[jira] [Commented] (MAPREDUCE-5777) Support utf-8 text with BOM (byte order marker)

zhihai xu (JIRA) Thu, 29 May 2014 02:06:22 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012207#comment-14012207
 ]


zhihai xu commented on MAPREDUCE-5777:
--------------------------------------

BC, thanks, It is a great comment.

Yes, the following suggested change looks better
      int newMaxLineLength = Math.min(3L + maxLineLength, Integer.MAX_VALUE);
also It is good to know maxLineLength < 3 never happens.

I want to discuss the other two points:

First whether the BOM marker should be counted as number of bytes in the first 
line.
It look like these 3 bytes UTF-8 BOM are added to the original document. It 
didn't belong to the original document.
BOM has no meaning in UTF-8. Many pieces of software on Microsoft Windows such 
as Notepad will  add a BOM to the start when saving text as  UTF-8. Google Docs 
will add a BOM when a Microsoft Word document is downloaded as a plain text 
file.
Google Data API has an UnicodeReader which will skip the BOM. 
For me, I am a little preferring to not count it as number of bytes in the 
first line because we try to strip the BOM(treat it the same as no BOM).

Second If we read 3 extra characters for the first line, this theoretically 
could alter existing behavior.
Originally I also thought this will be a problem, then I find out the following 
logic in the code:
If the return size from readLine is no less than maxLineLength, we will discard 
the current line and read the next line.
and also readLine will move file pointer to the next line and copy up to the 
newMaxLineLength bytes to Text buffer and return the real line length
(refer to readLine implementation)

     newSize = in.readLine(value, newMaxLineLength,
            Math.max(maxBytesToConsume(pos), newMaxLineLength));
     newSize -= 3; //if find BOM
     if (newSize < maxLineLength) {
        return true;
      }
Based on this logic, if we try to set newMaxLineLength larger than original 
maxLineLength in readLine, we won't alter existing behavior.
because the newSize is smaller than maxLineLength and the number of bytes 
copied to Text buffer is always no more than newSize.
I should add comment in the code to clarify this confusion.

> Support utf-8 text with BOM (byte order marker)
> -----------------------------------------------
>
>                 Key: MAPREDUCE-5777
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5777
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 0.22.0, 2.2.0
>            Reporter: bc Wong
>            Assignee: zhihai xu
>         Attachments: MAPREDUCE-5777.patch
>
>
> UTF-8 text may have a BOM. TextInputFormat, KeyValueTextInputFormat and 
> friends should recognize the BOM and not treat it as actual data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5777) Support utf-8 text with BOM (byte order marker)

Reply via email to