Hi,
I am stuck with some questions based on following scenario.
1) Hadoop normally splits the input file and distributes the splits
across slaves(referred to as Psplits from now), in to chunks of 64 MB.
a) Is there Any way to specify split criteria so for example a huge 4
GB file is split in to 40 odd files(Psplits) respecting record boundaries ?
b) Is it even required that these physical splits(Psplits) obey record
boundaries ?
2) We can get locations of these Psplits on HDFS as follows
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0,
length); //FileInputFormat line 273
In FileInputFormat, for each blkLocations(Psplit) multiple logical
splits(referred to as Lsplits from now) are created based on hueristic
for number of mappers.
Q) How is following situation handled in TextInputFormat which reads
line by line,
i) Input File is split as described in step 1 in more than 2 parts
ii) Suppose there is a line of text which starts near end of
Psplit-i and end in Psplit-i+1 (say Psplit2 and Psplit3)
iii) Which mapper gets this line spanning multiple Psplits(mapper_i
or mapper_i+1)
iv) I went through the FileInputFormat code, Lsplits are done only
for a particular pSplit not across pSplit. Why so ?
Q) In short, If one has to read arbitary objects(not line), how does one
handle records which are partially in one PSplit and partially in other.
--Amit