[ 
https://issues.apache.org/jira/browse/FLINK-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369322#comment-16369322
 ] 

ASF GitHub Bot commented on FLINK-8599:
---------------------------------------

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/flink/pull/5521#discussion_r169132404
  
    --- Diff: 
flink-core/src/main/java/org/apache/flink/api/common/io/FileInputFormat.java ---
    @@ -691,6 +691,12 @@ public void open(FileInputSplit fileSplit) throws 
IOException {
                        LOG.debug("Opening input split " + fileSplit.getPath() 
+ " [" + this.splitStart + "," + this.splitLength + "]");
                }
     
    +           if (!exists(fileSplit.getPath())) {
    --- End diff --
    
    you are doubling the number of checks for file existence here, which, when 
working with S3 implies three more HTTP requests which takes time and cost 
money. Better to do the open() call and catch FileNotFoundException, which all 
filesystems are required to throw if they are given a path which doesn't 
resolve to a file.


> Improve the failure behavior of the FileInputFormat for bad files
> -----------------------------------------------------------------
>
>                 Key: FLINK-8599
>                 URL: https://issues.apache.org/jira/browse/FLINK-8599
>             Project: Flink
>          Issue Type: New Feature
>          Components: DataStream API
>    Affects Versions: 1.4.0, 1.3.2
>            Reporter: Chengzhi Zhao
>            Priority: Major
>
> So we have a s3 path that flink is monitoring that path to see new files 
> available.
> {code:java}
> val avroInputStream_activity = env.readFile(format, path, 
> FileProcessingMode.PROCESS_CONTINUOUSLY, 10000)  {code}
>  
> I am doing both internal and external check pointing and let's say there is a 
> bad file (for example, a different schema been dropped in this folder) came 
> to the path and flink will do several retries. I want to take those bad files 
> and let the process continue. However, since the file path persist in the 
> checkpoint, when I try to resume from external checkpoint, it threw the 
> following error on no file been found.
>  
> {code:java}
> java.io.IOException: Error opening the Input Split s3a://myfile [0,904]: No 
> such file or directory: s3a://myfile{code}
>  
> As [~fhue...@gmail.com] suggested, we could check if a path exists and before 
> trying to read a file and ignore the input split instead of throwing an 
> exception and causing a failure.
>  
> Also, I am thinking about add an error output for bad files as an option to 
> users. So if there is any bad files exist we could move them in a separated 
> path and do further analysis. 
>  
> Not sure how people feel about it, but I'd like to contribute on it if people 
> think this can be an improvement. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to