Deepak is right here. The line-reading technique is explained in
further detail at http://wiki.apache.org/hadoop/HadoopMapReduce.

On Fri, Apr 27, 2012 at 2:37 AM, Deepak Nettem <deepaknet...@gmail.com> wrote:
> HDFS doesn't care about the contents of the file. The file gets divided
> into 64MB Blocks.
>
> For example, If your input file contains data in custom format (like
> Paragraphs) and you want the files to split as per paragraphs, HDFS isn't
> responsible - and rightly so.
>
> The application developer needs to use a custom InputFormat which
> internally uses RecordReader and InputSplit. The default, text input format
> makes sure that your mappers get each line as an input. The lines that span
> two blocks are handled by the InputSplit which makes sure that the
> necessary bytes from two blocks are made available, and Record Reader
> actually converts that byte view into (key,value).
>
>
>
> On Thu, Apr 26, 2012 at 4:59 PM, Barry, Sean F <sean.f.ba...@intel.com>wrote:
>
>> I guess what I meant to say was, how does hadoop make 64M blocks without
>> cutting off parts of words at the end of each block? Does it only make
>> blocks at whitespace?
>>
>> -SB
>>
>> -----Original Message-----
>> From: Michael Segel [mailto:michael_se...@hotmail.com]
>> Sent: Thursday, April 26, 2012 1:56 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Changing the Java heap
>>
>> Not sure of your question.
>>
>> Java child Heap size is independent of how files are split on HDFS.
>>
>> I suggest you look at Tom White's book on HDFS and how files are split in
>> to blocks.
>>
>> Blocks are split on set sizes. 64MB by default.
>> Your record boundaries are not necessarily on file block boundaries so one
>> process may read the rest of the last record in block A and then complete
>> reading it at the start of block B. A different task may start with block B
>> and skip the first n bytes until it hits the start of a record.
>>
>> HTH
>>
>> -Mike
>>
>> On Apr 26, 2012, at 3:46 PM, Barry, Sean F wrote:
>>
>> > Within my small 2 node cluster I set up my 4 core slave node to have 4
>> task trackers and I also limited my java heap size to -Xmx1024m
>> >
>> > Is there a possibility that when the data gets broken up that it will
>> break it at a place in the file that is not a whitespace? Or is that
>> already handled when the data on HDFS is broken up into blocks?
>> >
>> > -SB
>>
>>
>
>
> --
> Warm Regards,
> Deepak Nettem <http://www.cs.stonybrook.edu/%7Ednettem/>



-- 
Harsh J

Reply via email to