On 08/01/2012 09:44 AM, Grandl Robert wrote:
Hi,

Probably this question is answered many times but I could not clarify yet after 
searching on google.


Does HDFS split the input solely based on fixed block size or take in 
consideration the semantics of it ?
For example, if I have a binary file, or I want the block to not cut some lines 
of text, etc. will I be able to instruct HDFS where to stop with each block ?

Thanks,
Robert


Hadoop can natively understand text-based data. (As long as it's in a one-record-per-line format.)

It obviously does not understand custom binary formats. (E.g. Microsoft Word files.)

However Hadoop does provide a framework for you to create your own binary formats that it can understand. There is a class in Hadoop called a SequenceFile which provides the capability to create binary files that are broken up into logical blocks that Hadoop can split on.

HTH,

DR

Reply via email to