[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL][SS] Provide option to load files after a specified date when reading from a folder path

GitBox Wed, 01 Jul 2020 23:18:17 -0700


cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-652807186



   @HeartSaVioR Thank you for your detailed comments.  I've been digging into 
the PR you mentioned along with the associated Kafka Batch sources, etc.  I'm 
leaning towards separating the PRs mainly to reduce complexity in any one PR.  
I have a few questions.
   
   1.) By separating these PRs, the offset-based semantics would just apply to 
structured streaming correct?  Meaning, _modifiedDateFilter_ would just be used 
for the batch case?  The Kafka batch example uses batch reading with 
offset-based semantics but that seems unintuitive for the file data source uses 
case.
   
   2.) _startingOffsetbyTimestamp_ and the associated semantics refer to _the 
start point of timestamp when a query is started_.  In the file stream source 
use case, there seems to be a distinctive difference between the  file 
_modified date_ and when the query itself is started.  From what I'm gathering, 
because an offset represents a file itself, the language in this sense would 
actually relate the the modified timestamp on the file as opposed to when the 
query itself was started?  In effect, the file stream is abstract based on the 
modified time of the file itself?
   
   3.) If a file is modified and exists in SeenFilesMap, but is subsequently 
modified, I'm guessing one file being modified means the entire file will be 
reconsumed as we don't consider partial files, correct?
   
   4.) Is there an ideal way to exclude the streaming use case from 
_PartioningAwareFileIndex_?
   
   Thank you


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL][SS] Provide option to load files after a specified date when reading from a folder path

Reply via email to