[ http://issues.apache.org/jira/browse/HADOOP-473?page=all ]

Dennis Kubes updated HADOOP-473:
--------------------------------

    Attachment: text-input-format2.patch

Sorry it took me so long.  I had the patch a couple of days ago I was just in 
the middle of testing.  Here is an updated patch that removes the read ahead 
and fixes the same problem of line endings in the getRecordReader method of 
TextInputFormat.  This has been lightly tested.

> TextInputFormat does not correctly handle all line endings
> ----------------------------------------------------------
>
>                 Key: HADOOP-473
>                 URL: http://issues.apache.org/jira/browse/HADOOP-473
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.5.0, 0.6.0
>         Environment: All environments
>            Reporter: Dennis Kubes
>         Attachments: text-input-format.patch, text-input-format2.patch
>
>
> The current TextInputFormat readLine method calls break on either a single 
> '\r' or '\n' character.  This causes windows formatted text files '\r' '\n' 
> to leave a trailing '\n' character and the next time the readLine method is 
> called on the same input stream it returns a blank string.  The patch 
> attached corrects this issue by looking for either single or double character 
> line endings and positions the input stream to the next line.  It correctly 
> handles windows, mac, and unix line endings.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to