fhueske commented on a change in pull request #6823: [FLINK-10134] UTF-16
support for TextInputFormat bug refixed
URL: https://github.com/apache/flink/pull/6823#discussion_r226574036
##########
File path:
flink-core/src/main/java/org/apache/flink/api/common/io/DelimitedInputFormat.java
##########
@@ -472,6 +498,7 @@ public void open(FileInputSplit split) throws IOException {
this.offset = splitStart;
if (this.splitStart != 0) {
+ setBomFileCharset(split);
Review comment:
I am suggesting that we only do a BOM check if the user did not configure a
charset or if the user explicitly configured an UTF charset. In case the user
explicitly configured a different charset, we do not check the BOM because it
could also be valid data.
`InputStreamFSInputWrapper`'s limitation in seeking makes it a bit tricky to
read the first bytes of a file, since the stream is opened and seeked in
`FileInputFormat.open()`. Maybe we can implement a protected method
`FileInputFormat.readFileHeader()` that is called after the stream was opened.
`DelimitedInputFormat` can then override the method to check for the BOM. By
default, the method should not do anything.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services