File Splits in Hadoop

amitsingh Wed, 10 Dec 2008 11:08:56 -0800

Hi,

I am stuck with some questions based on following scenario.

1) Hadoop normally splits the input file and distributes the splitsacross slaves(referred to as Psplits from now), in to chunks of 64 MB.a) Is there Any way to specify split criteria so for example a huge 4GB file is split in to 40 odd files(Psplits) respecting record boundaries ?b) Is it even required that these physical splits(Psplits) obey recordboundaries ?


2) We can get locations of these Psplits on HDFS as follows

BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0,length); //FileInputFormat line 273In FileInputFormat, for each blkLocations(Psplit) multiple logicalsplits(referred to as Lsplits from now) are created based on hueristicfor number of mappers.

Q) How is following situation handled in TextInputFormat which readsline by line,

   i) Input File is split as described in step 1 in more than 2 parts

ii) Suppose there is a line of text which starts near end ofPsplit-i and end in Psplit-i+1 (say Psplit2 and Psplit3)iii) Which mapper gets this line spanning multiple Psplits(mapper_ior mapper_i+1)iv) I went through the FileInputFormat code, Lsplits are done onlyfor a particular pSplit not across pSplit. Why so ?

Q) In short, If one has to read arbitary objects(not line), how does onehandle records which are partially in one PSplit and partially in other.


--Amit

File Splits in Hadoop

Reply via email to