cchighman edited a comment on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-644793641


   Thanks for your comments, @bart-samwel.  I like your way of thinking, there 
are a lot of unique cases here.  To provide more context behind the scenario 
I'm looking to cover which is a current issue for consumers:  
   
   - Imagine you have a massive, massive data lake with routine ETL operations.
   - Every couple hours or so, a CSV file is dropped in a "Delta" folder 
containing perhaps 50 million events, per dataset, and you have a lot of these 
various datasets.
   - Over time, going back a handful of years, the folder hierarchy was rather 
deterministic which seems to be a common practice, such that you have 
_/dataset/delta/yyyy-mm-dd/dataset_guid_timestamp.csv_ as folder structure.
   - A number of teams may need to begin consuming these files but they are 
only interested in consuming them starting from a particular date.  Prior to 
this date, there is no longer any interest, and they hope to consume all the 
delta files for events up to the current date from the specified modified date 
without needing to write code that concatenates or embeds this for them.
   - From this perspective, enterprise consumers have value in being able to 
specify a modified timestamp to help _checkpoint_ what deltas they're 
interested in consuming.
   
   Granted, this context is specific to non-streaming file data sources.  I was 
hopeful to find an equivalent perhaps with Structured Streaming but the closest 
I found was _latestFirst_ and _maxFileAge_ which each have their respective use 
cases but does not solve this particular one.  The connective tissue between my 
change here lies in the fact that Structured Streaming also leverages 
InMemoryFileIndex and actively passes a parameter map to its constructor.  I'll 
provide a PR to complete support there, as well, but separately from this MVP 
piece.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to