fhueske commented on a change in pull request #6710: [FLINK-10134] UTF-16
support for TextInputFormat bug fixed
URL: https://github.com/apache/flink/pull/6710#discussion_r221890642
##########
File path:
flink-core/src/main/java/org/apache/flink/api/common/io/FileInputFormat.java
##########
@@ -601,41 +602,44 @@ public LocatableInputSplitAssigner
getInputSplitAssigner(FileInputSplit[] splits
if (unsplittable) {
int splitNum = 0;
for (final FileStatus file : files) {
+ String bomCharsetName = getBomCharset(file);
Review comment:
I'm not sure if we want to check the BOM during split generation.
1. This might become a bottleneck, since splits are generated by the
JobManager. OTOH, there is currently an effort to parallelize split generation.
2. FileInputFormat is currently not handling any charset issues.
An alternative would be to check the BOM in `DelimitedInputFormat` when a
split is opened.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services