mikedias commented on issue #23782: [SPARK-26875][SQL] Add an option on 
FileStreamSource to include modified files
URL: https://github.com/apache/spark/pull/23782#issuecomment-463862678
 
 
   Maybe is not clear that the patch does not change any processing behavior, 
it only adds an option to consider "is this file new and should be included on 
the micro-batch?"
   
   This is the current `FileSourceStream` workflow:
    1. When it starts, it read the checkpoint data to get a list of previously 
processed files and put them in a map called `seenFiles` where the key is 
`filename` and value is `timestamp`.
    2. For each micro-batch, it lists all files in the directory and checks if 
the `seenFiles` map contains the filename to determine if it is a new file or 
not. Here is where I'm proposing the change.
    3. With the list of new files, it creates a `DataSource` instance that will 
handle the correct file format and codecs on and setup a `Dataset`. These are 
the classes responsible for reading the file, not changing anything here.
    4. The micro-batch gets the `Dataset` and executes it. When it finishes, 
update the checkpoint data with the processed filenames. Then it goes back to 
step 1.
   
   Answering your question:
   > How would you deal with the simplest situation: Spark is archiving and 
producer is uploading the same?
    - `includeModifiedFiles=true`: the file content will be processed in the 
next micro-batch.
    - `includeModifiedFiles=false`: the file content will be ignored.
   
   Regarding eventually race conditions, nothing is changed. Spark will deal 
with it using the `Datasource` current mechanisms.
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to