[ https://issues.apache.org/jira/browse/SPARK-19813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-19813: ------------------------------------ Assignee: Apache Spark (was: Burak Yavuz) > maxFilesPerTrigger combo latestFirst may miss old files in combination with > maxFileAge in FileStreamSource > ---------------------------------------------------------------------------------------------------------- > > Key: SPARK-19813 > URL: https://issues.apache.org/jira/browse/SPARK-19813 > Project: Spark > Issue Type: Bug > Components: Structured Streaming > Affects Versions: 2.1.0 > Reporter: Burak Yavuz > Assignee: Apache Spark > > There is a file stream source option called maxFileAge which limits how old > the files can be, relative the latest file that has been seen. This is used > to limit the files that need to be remembered as "processed". Files older > than the latest processed files are ignored. This values is by default 7 days. > This causes a problem when both > - latestFirst = true > - maxFilesPerTrigger > total files to be processed. > Here is what happens in all combinations > 1) latestFirst = false - Since files are processed in order, there wont be > any unprocessed file older than the latest processed file. All files will be > processed. > 2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge > thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is > not, then all old files get processed in the first batch, and so no file is > left behind. > 3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch > process the latest X files. That sets the threshold latest file - maxFileAge, > so files older than this threshold will never be considered for processing. > The bug is with case 3. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org