fhueske commented on a change in pull request #6710: [FLINK-10134] UTF-16
support for TextInputFormat bug fixed
URL: https://github.com/apache/flink/pull/6710#discussion_r221932785
##########
File path:
flink-java/src/main/java/org/apache/flink/api/java/io/TextInputFormat.java
##########
@@ -85,14 +90,26 @@ public void configure(Configuration parameters) {
@Override
public String readRecord(String reusable, byte[] bytes, int offset, int
numBytes) throws IOException {
+ String utf8 = "UTF-8";
+ String utf16 = "UTF-16";
+ String utf32 = "UTF-32";
+ int stepSize = 0;
+ String charsetName = this.getCharsetName();
+ if (charsetName.contains(utf8)) {
+ stepSize = 1;
+ } else if (charsetName.contains(utf16)) {
+ stepSize = 2;
+ } else if (charsetName.contains(utf32)) {
+ stepSize = 4;
+ }
//Check if \n is used as delimiter and the end of this line is
a \r, then remove \r from the line
if (this.getDelimiter() != null && this.getDelimiter().length
== 1
- && this.getDelimiter()[0] == NEW_LINE && offset
+ numBytes >= 1
- && bytes[offset + numBytes - 1] ==
CARRIAGE_RETURN){
- numBytes -= 1;
+ && this.getDelimiter()[0] == NEW_LINE && offset +
numBytes >= stepSize
Review comment:
We only check the first byte of a character. Are these checks actually
compatible with with all encodings (LE and BE)?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services