[jira] [Commented] (MAPREDUCE-5777) Support utf-8 text with BOM (byte order marker)

bc Wong (JIRA) Wed, 28 May 2014 15:07:35 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011683#comment-14011683
 ]


bc Wong commented on MAPREDUCE-5777:
------------------------------------

Thanks for the patch, Zhihai.

{noformat}
        int newMaxLineLength = Integer.MAX_VALUE;
        if (maxLineLength < newMaxLineLength - 3) {
          newMaxLineLength = maxLineLength + 3;
        }
{noformat}
The above would be clearer as:
{noformat}
        int newMaxLineLength = Math.min(3L + maxLineLength, Integer.MAX_VALUE);
{noformat}

Having said that, I think we shouldn't modify the read length for the first 
line. We should just use the existing {{maxLineLength}}:
* The {{maxLineLength}} is counting number of bytes, not number of utf8 
characters. If a line gets longer or shorter because it's in English vs 
Klingon, then so be it. The BOM marker should be counted as number of bytes in 
the first line. The byte length is imprecise anyways.
* Most utf8 input files do not have BOM markers. If we read 3 extra characters 
for the first line, this theoretically could alter existing behaviour. I think 
the likelihood is small. But just want to be careful.
* If {{maxLineLength < 3}}, which never happens btw, then it's ok to not check 
for BOM.

> Support utf-8 text with BOM (byte order marker)
> -----------------------------------------------
>
>                 Key: MAPREDUCE-5777
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5777
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 0.22.0, 2.2.0
>            Reporter: bc Wong
>            Assignee: zhihai xu
>         Attachments: MAPREDUCE-5777.patch
>
>
> UTF-8 text may have a BOM. TextInputFormat, KeyValueTextInputFormat and 
> friends should recognize the BOM and not treat it as actual data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5777) Support utf-8 text with BOM (byte order marker)

Reply via email to