Lei Chen wrote:
It seems that big file can be split within one line. But the map/reduce will still work properly since the dfs layer will hide the block layout information from the map/reduce tasks.
It's up to the InputFormat to handle records that are split on FileSplit boundaries.
TextInputFormat apparently reads a line past the end of the Split boundary and starts reading from the first linebreak encountered. See http://svn.apache.org/viewcvs.cgi/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/TextInputFormat.java?view=markup for details.
(I added this info to http://wiki.apache.org/lucene-hadoop/HadoopMapReduce).
