Hi,

I am stuck with some questions based on following scenario.

1) Hadoop normally splits the input file and distributes the splits across slaves(referred to as Psplits from now), in to chunks of 64 MB. a) Is there Any way to specify split criteria so for example a huge 4 GB file is split in to 40 odd files(Psplits) respecting record boundaries ? b) Is it even required that these physical splits(Psplits) obey record boundaries ?

2) We can get locations of these Psplits on HDFS as follows
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length); //FileInputFormat line 273 In FileInputFormat, for each blkLocations(Psplit) multiple logical splits(referred to as Lsplits from now) are created based on hueristic for number of mappers.

Q) How is following situation handled in TextInputFormat which reads line by line,
   i) Input File is split as described in step 1 in more than 2 parts
ii) Suppose there is a line of text which starts near end of Psplit-i and end in Psplit-i+1 (say Psplit2 and Psplit3) iii) Which mapper gets this line spanning multiple Psplits(mapper_i or mapper_i+1) iv) I went through the FileInputFormat code, Lsplits are done only for a particular pSplit not across pSplit. Why so ?

Q) In short, If one has to read arbitary objects(not line), how does one handle records which are partially in one PSplit and partially in other.

--Amit



Reply via email to