[GitHub] [spark] HeartSaVioR commented on pull request #28841: [SPARK-31962][SQL][SS] Provide option to load files after a specified date when reading from a folder path

GitBox Thu, 02 Jul 2020 00:59:17 -0700


HeartSaVioR commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-652850721



   @cchighman Thanks for reading through the huge wall of text! 
   
   I agree the option can be provided to batch query only, and consider how to 
apply the option to structured streaming later (as we don't have solid idea 
yet). Just to reiterate, I guess we may want to still discuss only lower bound 
vs lower & upper bound, even in batch case.
   
   I also agree the option can be simply applied to the structured streaming 
(only for lower bound) on top of current options. That would play as a 
"filter". As I mentioned in #28422 I already proposed the similar thing, though 
the purpose was for applying "retention" hence dynamically changing instead of 
be static.
   
   That said, the problem is, this approach doesn't help to cut down file 
stream source metadata log, which is another known major issue in file stream 
source. File stream source remembers every file you processed and never drops 
anything. My viewpoint is focused on how we can minimize the entries to 
remember across long query run. We have maxFileAge option but due to some issue 
it doesn't help minimizing the entries. It wouldn't be good if we introduce 
another similar option but leave the major issue behind. 
   
   That's why I cannot agree simply to apply the option to SS. It deserves 
another level of consideration.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on pull request #28841: [SPARK-31962][SQL][SS] Provide option to load files after a specified date when reading from a folder path

Reply via email to