cchighman edited a comment on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-652843996
@HeartSaVioR I still think implementing this at the
_PartitioningAwareFileIndex_ level makes a lot of sense and bypasses all the
complexities you mentioned above. There can be some cases where you begin
streaming from a file source that could have hundreds of thousands of files and
many with the same timestamp. You want to start the process at a specified
point. _PartitioningAwareFileIndex_ is processed before any other options for
structured streaming are considered during _fetchMaxOffset_. I believe
_modifiedDateFilter_ is a great way to determine where you want to start
streaming from and is limited to that use case. The semantics for offset I
believe completely apply but I think they would apply to the files that are
returned from _InMemoryFileIndex_ or _MetadataLogFileIndex_.
This option is very intuitive for the consumer because, for any given path,
they can explicitly set the population of files that would be considered for
structured streaming. `allFiles` in `fetchMaxOffset` would return the starting
point that would represent the earliest/latest offsets. Do you see the
difference?
Granted, I can conceptualize how this could be implemented in
_FileStreamSource_. It seems though like the problems you're describing
shouldn't impact how we would ultimately filter files based on parameters which
seek to limit more of an unbounded problem we might have currently? I'm asking
this just to understand if the complexity is as easy as just adding an extra
layer of filtering if the options are specified.
Seems most consideration would be placed in this area in relation to the
_seenFilesMap_ and _metadataLogCurrentOffset_
` private def fetchMaxOffset(limit: ReadLimit): FileStreamSourceOffset =
synchronized {`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]