[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL][SS] Provide option to load files after a specified date when reading from a folder path

GitBox Thu, 02 Jul 2020 01:13:09 -0700


cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-652853468



   > @cchighman Thanks for reading through the huge wall of text!
   > 
   > I agree the option can be provided to batch query only, and consider how 
to apply the option to structured streaming later (as we don't have solid idea 
yet). Just to reiterate, I guess we may want to still discuss only lower bound 
vs lower & upper bound, even in batch case.
   > 
   > I also agree the option can be simply applied to the structured streaming 
(only for lower bound) on top of current options. That would play as a 
"filter". As I mentioned in #28422 I already proposed the similar thing, though 
the purpose was for applying "retention" hence dynamically changing instead of 
be static.
   > 
   > That said, the problem is, this approach doesn't help to cut down file 
stream source metadata log, which is another known major issue in file stream 
source. File stream source remembers every file you processed and never drops 
anything. My viewpoint is focused on how we can minimize the entries to 
remember across long query run. We have maxFileAge option but due to some issue 
it doesn't help minimizing the entries. It wouldn't be good if we introduce 
another similar option but leave the major issue behind.
   > 
   > That's why I cannot agree simply to apply the option to SS. It deserves 
another level of consideration.
   
   In terms of cutting down the file stream source metadata log, if we apply 
this option as a filter, it means the files are never available to be added to 
the metadata log.  If we apply this option with offsets, we effectively filter 
out timestamps that occur before a given starting offset and/or between a 
specified offset range, which results in the files not being added to the 
metadata log, correct? 
   
   Considering right now when streaming from a folder path, you have to 
consider all files in the path and those have to be added to the log, both 
approaches seem like they would cut back on the the log size in some shape or 
form.  Is my thinking correct here?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL][SS] Provide option to load files after a specified date when reading from a folder path

Reply via email to