cchighman commented on pull request #28841: URL: https://github.com/apache/spark/pull/28841#issuecomment-652853468
> @cchighman Thanks for reading through the huge wall of text! > > I agree the option can be provided to batch query only, and consider how to apply the option to structured streaming later (as we don't have solid idea yet). Just to reiterate, I guess we may want to still discuss only lower bound vs lower & upper bound, even in batch case. > > I also agree the option can be simply applied to the structured streaming (only for lower bound) on top of current options. That would play as a "filter". As I mentioned in #28422 I already proposed the similar thing, though the purpose was for applying "retention" hence dynamically changing instead of be static. > > That said, the problem is, this approach doesn't help to cut down file stream source metadata log, which is another known major issue in file stream source. File stream source remembers every file you processed and never drops anything. My viewpoint is focused on how we can minimize the entries to remember across long query run. We have maxFileAge option but due to some issue it doesn't help minimizing the entries. It wouldn't be good if we introduce another similar option but leave the major issue behind. > > That's why I cannot agree simply to apply the option to SS. It deserves another level of consideration. In terms of cutting down the file stream source metadata log, if we apply this option as a filter, it means the files are never available to be added to the metadata log. If we apply this option with offsets, we effectively filter out timestamps that occur before a given starting offset and/or between a specified offset range, which results in the files not being added to the metadata log, correct? Considering right now when streaming from a folder path, you have to consider all files in the path and those have to be added to the log, both approaches seem like they would cut back on the the log size in some shape or form. Is my thinking correct here? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
