[GitHub] [spark] cchighman edited a comment on pull request #28841: [SPARK-31962][SQL][SS] Provide option to load files after a specified date when reading from a folder path

GitBox Thu, 02 Jul 2020 00:48:18 -0700


cchighman edited a comment on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-652843996



   @HeartSaVioR I still think implementing this at the 
_PartitioningAwareFileIndex_ level makes a lot of sense and bypasses all the 
complexities you mentioned above.  There can be some cases where you begin 
streaming from a file source that could have hundreds of thousands of files and 
many with the same timestamp.  You want to start the process at a specified 
point.  _PartitioningAwareFileIndex_ is processed before any other options for 
structured streaming are considered during _fetchMaxOffset_.  I believe 
_modifiedDateFilter_ is a great way to determine where you want to start 
streaming from and is limited to that use case.  The semantics for offset I 
believe completely apply but I think they would apply to the files that are 
returned from _InMemoryFileIndex_ or _MetadataLogFileIndex_.
   
   This option is very intuitive for the consumer because, for any given path, 
they can explicitly set the population of files that would be considered for 
structured streaming.  `allFiles` in `fetchMaxOffset` would return the starting 
point that would represent the earliest/latest offsets.  Do you see the 
difference?
   
   Granted, I can conceptualize how this could be implemented in 
_FileStreamSource_.  It seems though like the problems you're describing 
shouldn't impact how we would ultimately filter files based on parameters which 
seek to limit more of an unbounded problem we might have currently?  I'm asking 
this just to understand if the complexity is as easy as just adding an extra 
layer of filtering if the options are specified.
   
   Seems most consideration would be placed in this area in relation to the 
_seenFilesMap_ and _metadataLogCurrentOffset_
   `  private def fetchMaxOffset(limit: ReadLimit): FileStreamSourceOffset = 
synchronized {`
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cchighman edited a comment on pull request #28841: [SPARK-31962][SQL][SS] Provide option to load files after a specified date when reading from a folder path

Reply via email to