map side only behavior

Gang Luo Fri, 29 Jan 2010 07:41:00 -0800

Hi all,
If I only use map side to process my data (set # of reducers to 0 ), what is 
the behavior of hadoop? Will it merge and sort each of the spills generated by 
one mapper?


-Gang


----- 原始邮件 ----
发件人： Gang Luo <[email protected]>
收件人： [email protected]
发送日期： 2010/1/29 (周五) 8:54:33 上午
主   题： Re: fine granularity operation on HDFS

Yeah, I see how it works. Thanks Amogh.


-Gang



----- 原始邮件 ----
发件人： Amogh Vasekar <[email protected]>
收件人： "[email protected]" <[email protected]>
发送日期： 2010/1/28 (周四) 10:00:22 上午
主   题： Re: fine granularity operation on HDFS

Hi Gang,
Yes PathFilters work only on file paths. I meant you can include such type of 
logic at split level.
The input format's getSplits() method is responsible for computing and adding 
splits to a list container, for which JT initializes mapper tasks. You can 
override the getSplits() method to add only a few , say, based on the location 
or offset, to the list. Here's the reference :
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
                                   blkLocations[blkIndex].getHosts()));
          bytesRemaining -= splitSize;
        }

        if (bytesRemaining != 0) {
          splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,
                     blkLocations[blkLocations.length-1].getHosts()));

Before splits.add you can use your logic for discarding. However, you need to 
ensure your record reader takes care of incomplete records at boundaries.

To get the block locations to load separately, the FileSystem class APIs expose 
few methods like getBlockLocations etc ..
Hope this helps.

Amogh

On 1/28/10 7:26 PM, "Gang Luo" <[email protected]> wrote:

Thanks Amogh.

For the second part of my question, I actually mean loading block separately 
from HDFS. I don't know whether it is realistic. Anyway, for my goal is to 
process different division of a file separately, to do that at split level is 
OK. But even I can get the splits from inputformat, how to "add only a few 
splits you need to mapper and discard the others"? (pathfilters only works for 
file, but not block, I think).

Thanks.
-Gang



      ___________________________________________________________ 
  好玩贺卡等你发，邮箱贺卡全新上线！ 
http://card.mail.cn.yahoo.com/



      ___________________________________________________________ 
  好玩贺卡等你发，邮箱贺卡全新上线！ 
http://card.mail.cn.yahoo.com/

map side only behavior

Reply via email to