Thank you guys. Really helpful.
________________________________ From: Harsh J <ha...@cloudera.com> To: hdfs-user@hadoop.apache.org Sent: Wednesday, August 1, 2012 1:03 PM Subject: Re: HDFS splits based on content semantics To add onto David's response, also read http://search-hadoop.com/m/ydCoSysmTd1 for some more info. On Wed, Aug 1, 2012 at 7:23 PM, David Rosenstrauch <dar...@darose.net> wrote: > On 08/01/2012 09:44 AM, Grandl Robert wrote: >> >> Hi, >> >> Probably this question is answered many times but I could not clarify yet >> after searching on google. >> >> >> Does HDFS split the input solely based on fixed block size or take in >> consideration the semantics of it ? >> For example, if I have a binary file, or I want the block to not cut some >> lines of text, etc. will I be able to instruct HDFS where to stop with each >> block ? >> >> Thanks, >> Robert >> > > Hadoop can natively understand text-based data. (As long as it's in a > one-record-per-line format.) > > It obviously does not understand custom binary formats. (E.g. Microsoft > Word files.) > > However Hadoop does provide a framework for you to create your own binary > formats that it can understand. There is a class in Hadoop called a > SequenceFile which provides the capability to create binary files that are > broken up into logical blocks that Hadoop can split on. > > HTH, > > DR -- Harsh J