On Wed, Dec 10, 2008 at 11:12 AM, amitsingh <[EMAIL PROTECTED]>wrote:

> Hi,
>
> I am stuck with some questions based on following scenario.
>
> 1) Hadoop normally splits the input file and distributes the splits across
> slaves(referred to as Psplits from now), in to chunks of 64 MB.
> a) Is there Any way to specify split criteria  so for example a huge 4 GB
> file is split in to 40 odd files(Psplits) respecting record boundaries ?


you can set mapred.min.split.size in jobConf
you can set its value greater than block size and hence can force a split to
be larger than block size. However, this might result into splits having
data blocks that are not local.


>
> b) Is it even required that these physical splits(Psplits) obey record
> boundaries ?
>
> 2) We can get locations of these Psplits on HDFS as follows
> BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0,  length);
> //FileInputFormat line 273
> In FileInputFormat, for each blkLocations(Psplit) multiple logical
> splits(referred to as Lsplits from now) are created based on hueristic for
> number of mappers.
>
> Q) How is following situation handled in TextInputFormat which reads line
> by line,
>   i) Input File is split as described in step 1 in more than 2 parts
>   ii) Suppose there is a line of text which starts near end of Psplit-i and
> end in Psplit-i+1 (say Psplit2 and Psplit3)
>   iii) Which mapper gets this line spanning multiple Psplits(mapper_i or
> mapper_i+1)
>   iv) I went through the FileInputFormat code, Lsplits are done only for a
> particular pSplit not across pSplit. Why so ?
>
> Q) In short, If one has to read arbitary objects(not line), how does one
> handle records which are partially in one PSplit and partially in other.
>

I am working on this as well and not found exact answer, but in my view
mapper_i should handle the line / record which is partially in one split and
partially in other split. The mapper_i+1 should first seek beginning of new
record (line in this case) and start processing from there.

Someone from Hadoop core team please correct me if this is wrong and fill in
details.

Thanks,
Taran


> --Amit
>
>
>
>

Reply via email to