[
https://issues.apache.org/jira/browse/MAPREDUCE-5777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011683#comment-14011683
]
bc Wong commented on MAPREDUCE-5777:
------------------------------------
Thanks for the patch, Zhihai.
{noformat}
int newMaxLineLength = Integer.MAX_VALUE;
if (maxLineLength < newMaxLineLength - 3) {
newMaxLineLength = maxLineLength + 3;
}
{noformat}
The above would be clearer as:
{noformat}
int newMaxLineLength = Math.min(3L + maxLineLength, Integer.MAX_VALUE);
{noformat}
Having said that, I think we shouldn't modify the read length for the first
line. We should just use the existing {{maxLineLength}}:
* The {{maxLineLength}} is counting number of bytes, not number of utf8
characters. If a line gets longer or shorter because it's in English vs
Klingon, then so be it. The BOM marker should be counted as number of bytes in
the first line. The byte length is imprecise anyways.
* Most utf8 input files do not have BOM markers. If we read 3 extra characters
for the first line, this theoretically could alter existing behaviour. I think
the likelihood is small. But just want to be careful.
* If {{maxLineLength < 3}}, which never happens btw, then it's ok to not check
for BOM.
> Support utf-8 text with BOM (byte order marker)
> -----------------------------------------------
>
> Key: MAPREDUCE-5777
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5777
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Affects Versions: 0.22.0, 2.2.0
> Reporter: bc Wong
> Assignee: zhihai xu
> Attachments: MAPREDUCE-5777.patch
>
>
> UTF-8 text may have a BOM. TextInputFormat, KeyValueTextInputFormat and
> friends should recognize the BOM and not treat it as actual data.
--
This message was sent by Atlassian JIRA
(v6.2#6252)