How hdfs splits blocks on record boundaries

prasenjit mukherjee Wed, 13 Jun 2012 18:12:21 -0700

I have a textfile which doesn't have any newline characters. The
records are separated by a special character ( e.g. $ ). if I push a
single file of 5 GB to hdfs, how will it identify the boundaries on
which the files should be split ?


What are the options I have in such scenaion so that I can run mapreduce jobs :

1. Replace record-separator with new line ? ( Not very convincing as I
have newline in the data )

2. Create 64MB chunks by some preprocessing ? ( Would love to know if
it can be avoided )

3. I can definitely write my customloader for my mapreduce jobs, but
even then is it possible to reach out across hdfs nodes if the files
are not aligned with recoird boundaries ?

Thanks,
Prasenjit

-- 
Sent from my mobile device

How hdfs splits blocks on record boundaries

Reply via email to