[ 
https://issues.apache.org/jira/browse/FLINK-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626693#comment-16626693
 ] 

ASF GitHub Bot commented on FLINK-10134:
----------------------------------------

XuQianJin-Stars edited a comment on issue #6710: [FLINK-10134] UTF-16 support 
for TextInputFormat bug fixed
URL: https://github.com/apache/flink/pull/6710#issuecomment-424188954
 
 
   > I think we need to revisit the way this fix works. At this point, it 
probably makes sense to close the PR and go back to a design discussion first. 
I would suggest to have that design discussion on the JIRA issue first.
   > 
   > (1) Where should the BOM be read? Beginning of the file only? Then it 
should most likely be part of split generation.
   > 
   > (2) Any design cannot have additional logic in the method that is called 
per record, otherwise it will have too much performance impact.
   
   @StephanEwen Thank you for your suggestion. I close the PR first. Then 
discuss how to fix this bug.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> UTF-16 support for TextInputFormat
> ----------------------------------
>
>                 Key: FLINK-10134
>                 URL: https://issues.apache.org/jira/browse/FLINK-10134
>             Project: Flink
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.4.2
>            Reporter: David Dreyfus
>            Priority: Blocker
>              Labels: pull-request-available
>
> It does not appear that Flink supports a charset encoding of "UTF-16". It 
> particular, it doesn't appear that Flink consumes the Byte Order Mark (BOM) 
> to establish whether a UTF-16 file is UTF-16LE or UTF-16BE.
>  
> TextInputFormat.setCharset("UTF-16") calls DelimitedInputFormat.setCharset(), 
> which sets TextInputFormat.charsetName and then modifies the previously set 
> delimiterString to construct the proper byte string encoding of the the 
> delimiter. This same charsetName is also used in TextInputFormat.readRecord() 
> to interpret the bytes read from the file.
>  
> There are two problems that this implementation would seem to have when using 
> UTF-16.
>  # delimiterString.getBytes(getCharset()) in DelimitedInputFormat.java will 
> return a Big Endian byte sequence including the Byte Order Mark (BOM). The 
> actual text file will not contain a BOM at each line ending, so the delimiter 
> will never be read. Moreover, if the actual byte encoding of the file is 
> Little Endian, the bytes will be interpreted incorrectly.
>  # TextInputFormat.readRecord() will not see a BOM each time it decodes a 
> byte sequence with the String(bytes, offset, numBytes, charset) call. 
> Therefore, it will assume Big Endian, which may not always be correct. [1] 
> [https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/io/TextInputFormat.java#L95]
>  
> While there are likely many solutions, I would think that all of them would 
> have to start by reading the BOM from the file when a Split is opened and 
> then using that BOM to modify the specified encoding to a BOM specific one 
> when the caller doesn't specify one, and to overwrite the caller's 
> specification if the BOM is in conflict with the caller's specification. That 
> is, if the BOM indicates Little Endian and the caller indicates UTF-16BE, 
> Flink should rewrite the charsetName as UTF-16LE.
>  I hope this makes sense and that I haven't been testing incorrectly or 
> misreading the code.
>  
> I've verified the problem on version 1.4.2. I believe the problem exists on 
> all versions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to