XuQianJin-Stars commented on a change in pull request #6710: [FLINK-10134] 
UTF-16 support for TextInputFormat bug fixed
URL: https://github.com/apache/flink/pull/6710#discussion_r223259155
 
 

 ##########
 File path: 
flink-core/src/main/java/org/apache/flink/api/common/io/FileInputFormat.java
 ##########
 @@ -601,41 +602,44 @@ public LocatableInputSplitAssigner 
getInputSplitAssigner(FileInputSplit[] splits
                if (unsplittable) {
                        int splitNum = 0;
                        for (final FileStatus file : files) {
+                               String bomCharsetName = getBomCharset(file);
 
 Review comment:
   > I'm not sure if we want to check the BOM during split generation.
   > 
   > 1. This might become a bottleneck, since splits are generated by the 
JobManager. OTOH, there is currently an effort to parallelize split generation.
   > 2. FileInputFormat is currently not handling any charset issues.
   > 
   > An alternative would be to check the BOM in `DelimitedInputFormat` when a 
split is opened.
   
   @fhueske Hi, fhueske, if you check the BOM in DelimitedInputFormat when 
opening the split, I think the following should be considered:
   1. A file is split into different TaskManagers, then the BOM of the 
verification file is required on each TaskManager.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to