On 08/01/2012 09:44 AM, Grandl Robert wrote:
Hi,
Probably this question is answered many times but I could not clarify yet after
searching on google.
Does HDFS split the input solely based on fixed block size or take in
consideration the semantics of it ?
For example, if I have a binary file, or I want the block to not cut some lines
of text, etc. will I be able to instruct HDFS where to stop with each block ?
Thanks,
Robert
Hadoop can natively understand text-based data. (As long as it's in a
one-record-per-line format.)
It obviously does not understand custom binary formats. (E.g. Microsoft
Word files.)
However Hadoop does provide a framework for you to create your own
binary formats that it can understand. There is a class in Hadoop
called a SequenceFile which provides the capability to create binary
files that are broken up into logical blocks that Hadoop can split on.
HTH,
DR