Re: HDFS splits based on content semantics

David Rosenstrauch Wed, 01 Aug 2012 06:54:10 -0700

On 08/01/2012 09:44 AM, Grandl Robert wrote:

Hi,


Probably this question is answered many times but I could not clarify yet after 
searching on google.


Does HDFS split the input solely based on fixed block size or take in 
consideration the semantics of it ?
For example, if I have a binary file, or I want the block to not cut some lines 
of text, etc. will I be able to instruct HDFS where to stop with each block ?

Thanks,
Robert

Hadoop can natively understand text-based data. (As long as it's in aone-record-per-line format.)

It obviously does not understand custom binary formats. (E.g. MicrosoftWord files.)

However Hadoop does provide a framework for you to create your ownbinary formats that it can understand. There is a class in Hadoopcalled a SequenceFile which provides the capability to create binaryfiles that are broken up into logical blocks that Hadoop can split on.


HTH,

DR

Re: HDFS splits based on content semantics

Reply via email to