mikedias commented on issue #23782: [SPARK-26875][SQL] Add an option on FileStreamSource to include modified files URL: https://github.com/apache/spark/pull/23782#issuecomment-463862678 Maybe is not clear that the patch does not change any processing behavior, it only adds an option to consider "is this file new and should be included on the micro-batch?" This is the current `FileSourceStream` workflow: 1. When it starts, it read the checkpoint data to get a list of previously processed files and put them in a map called `seenFiles` where the key is `filename` and value is `timestamp`. 2. For each micro-batch, it lists all files in the directory and checks if the `seenFiles` map contains the filename to determine if it is a new file or not. Here is where I'm proposing the change. 3. With the list of new files, it creates a `DataSource` instance that will handle the correct file format and codecs on and setup a `Dataset`. These are the classes responsible for reading the file, not changing anything here. 4. The micro-batch gets the `Dataset` and executes it. When it finishes, update the checkpoint data with the processed filenames. Then it goes back to step 1. Answering your question: > How would you deal with the simplest situation: Spark is archiving and producer is uploading the same? - `includeModifiedFiles=true`: the file content will be processed in the next micro-batch. - `includeModifiedFiles=false`: the file content will be ignored. Regarding eventually race conditions, nothing is changed. Spark will deal with it using the `Datasource` current mechanisms.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
