Utkarsh, This question has been asked several times before. I've myself previously answered the same question at: http://www.mail-archive.com/mapreduce-user@hadoop.apache.org/msg04282.html
If HDFS says its block size is 64M, then that is what the block size is. HDFS is a filesystem, and writes only 64M bytes per block, and does not care about what the file carries (No FS cares what the file carries). The problem does not lie on the FS side. You need to think instead, "How do I read data from HDFS, if my records may lie across two blocks? Will I be able to?". It is up to the reader of the blocks to take care of record boundaries which may easily lie across blocks (And generally only MR does harder block boundary reading). The way MR's LineRecordReader (TextInputFormat) does it is explained here: http://wiki.apache.org/hadoop/HadoopMapReduce So in short: Don't worry, this is already taken care for you. On Fri, May 18, 2012 at 2:40 PM, Utkarsh Gupta <utkarsh_gu...@infosys.com>wrote: > Hi,**** > > ** ** > > I have a doubt about HDFS which may be a very trivial thing but I am not > able to understand it.**** > > ** ** > > Since hdfs keeps the files in block of 64/128 MB how does HDFS splits > files?**** > > The problem which I see is that suppose I have a long string in my input > file as:**** > > ** ** > > 672364,423746273,4234234,2,342,34,2,34,234,2,34,234,2,342,342**** > > ** ** > > This is to be processed in one map call. But because of blocks a part of > this line is in one block and next in another.**** > > ** ** > > Block1:**** > > --**** > > -**** > > - > this block goes to one mapper process**** > > -**** > > -**** > > 672364,423746273,4234**** > > <end of block1>**** > > ** ** > > Block2:**** > > 234,2,342,34,2,34,234,2,34,234,2,342,342**** > > -**** > > -**** > > - > this block goes to another mapper process**** > > ** ** > > ** ** > > How HDFS avoids this scenario?**** > > ** ** > > Thanks and Regards**** > > Utkarsh Gupta**** > > ** ** > > ** ** > > **************** CAUTION - Disclaimer ***************** > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely > for the use of the addressee(s). If you are not the intended recipient, please > notify the sender by e-mail and delete the original message. Further, you are > not > to copy, disclose, or distribute this e-mail or its contents to any other > person and > any such actions are unlawful. This e-mail may contain viruses. Infosys has > taken > every reasonable precaution to minimize this risk, but is not liable for any > damage > you may sustain as a result of any virus in this e-mail. You should carry out > your > own virus checks before opening the e-mail or attachment. Infosys reserves the > right to monitor and review the content of all messages sent to or from this > e-mail > address. Messages sent to or from this e-mail address may be stored on the > Infosys e-mail system. > ***INFOSYS******** End of Disclaimer ********INFOSYS*** > > -- Harsh J