Deepak is right here. The line-reading technique is explained in further detail at http://wiki.apache.org/hadoop/HadoopMapReduce.
On Fri, Apr 27, 2012 at 2:37 AM, Deepak Nettem <deepaknet...@gmail.com> wrote: > HDFS doesn't care about the contents of the file. The file gets divided > into 64MB Blocks. > > For example, If your input file contains data in custom format (like > Paragraphs) and you want the files to split as per paragraphs, HDFS isn't > responsible - and rightly so. > > The application developer needs to use a custom InputFormat which > internally uses RecordReader and InputSplit. The default, text input format > makes sure that your mappers get each line as an input. The lines that span > two blocks are handled by the InputSplit which makes sure that the > necessary bytes from two blocks are made available, and Record Reader > actually converts that byte view into (key,value). > > > > On Thu, Apr 26, 2012 at 4:59 PM, Barry, Sean F <sean.f.ba...@intel.com>wrote: > >> I guess what I meant to say was, how does hadoop make 64M blocks without >> cutting off parts of words at the end of each block? Does it only make >> blocks at whitespace? >> >> -SB >> >> -----Original Message----- >> From: Michael Segel [mailto:michael_se...@hotmail.com] >> Sent: Thursday, April 26, 2012 1:56 PM >> To: common-user@hadoop.apache.org >> Subject: Re: Changing the Java heap >> >> Not sure of your question. >> >> Java child Heap size is independent of how files are split on HDFS. >> >> I suggest you look at Tom White's book on HDFS and how files are split in >> to blocks. >> >> Blocks are split on set sizes. 64MB by default. >> Your record boundaries are not necessarily on file block boundaries so one >> process may read the rest of the last record in block A and then complete >> reading it at the start of block B. A different task may start with block B >> and skip the first n bytes until it hits the start of a record. >> >> HTH >> >> -Mike >> >> On Apr 26, 2012, at 3:46 PM, Barry, Sean F wrote: >> >> > Within my small 2 node cluster I set up my 4 core slave node to have 4 >> task trackers and I also limited my java heap size to -Xmx1024m >> > >> > Is there a possibility that when the data gets broken up that it will >> break it at a place in the file that is not a whitespace? Or is that >> already handled when the data on HDFS is broken up into blocks? >> > >> > -SB >> >> > > > -- > Warm Regards, > Deepak Nettem <http://www.cs.stonybrook.edu/%7Ednettem/> -- Harsh J