[GitHub] [flink] XuQianJin-Stars commented on issue #6710: [FLINK-10134] UTF-16 support for TextInputFormat bug fixed

GitHub Tue, 25 Sep 2018 19:42:51 -0700

@StephanEwen Hello, regarding the two questions you raised yesterday, I have 
some opinions about myself and I don’t know if it’s right.
1.Where should the BOM be read?I think it is still necessary to increase the 
logic for processing the bom when the file is started at the beginning of the 
file. Add an attribute to the read bom encoding logic to record the file bom 
encoding.For example: put it in the function `createInputSplits`.
2.Regarding the second performance problem, you can use the previously 
generated bom code to judge UTF8 with bom, UTF16 wuth bom, UTF32 with bom, and 
control the byte size to process the end of each line, because I found The 
previous bug garbled is actually a coding problem, one is caused by improper 
processing of the end byte of each line. I have done the following for this 
problem:
`String utf8 = "UTF-8";`
`String utf16 = "UTF-16";`
`String utf32 = "UTF-32";`
`int stepSize = 0;`
`String charsetName = this.getCharsetName();`
`if (charsetName.contains(utf8)) {`
       `stepSize = 1;`
`} else if (charsetName.contains(utf16)) {`
       `stepSize = 2;`
`} else if (charsetName.contains(utf32)) {`
       `stepSize = 4;`
`}`
`//Check if \n is used as delimiter and the end of this line is a \r, then 
remove \r from the line`
`if (this.getDelimiter() != null && this.getDelimiter().length == 1`
               `&& this.getDelimiter()[0] == NEW_LINE && offset + numBytes >= 
stepSize`
               `&& bytes[offset + numBytes - stepSize] == CARRIAGE_RETURN) {`
               `numBytes -= stepSize;`
`}`
`numBytes = numBytes - stepSize + 1;`
`return new String(bytes, offset, numBytes, this.getCharsetName());`
These are some of my own ideas. I hope that you can give some better 
suggestions and handle this jira better. Thank you.


[ Full content available at: https://github.com/apache/flink/pull/6710 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [flink] XuQianJin-Stars commented on issue #6710: [FLINK-10134] UTF-16 support for TextInputFormat bug fixed

Reply via email to