cchighman edited a comment on pull request #28841: URL: https://github.com/apache/spark/pull/28841#issuecomment-650726701
> Please take a look at how Kafka data source options apply with both batch and streaming query. The semantic of the option should be applied differently. > > http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries > > `startingOffsetsByTimestamp`, `startingOffsets`, `endingOffsetsByTimestamp`, `endingOffsets` > > If we are not fully sure about how to do it, let's only apply the option to batch query, and file an issue to address for the streaming query. > > Btw, that said, I prefer to have lower bound + upper bound instead of only lower bound, as commented earlier on reviewing. @HeartSaVioR Hmm, I'm wondering if this isn't a different feature. The goal of this feature is to begin reading from a file data source with files that have a particular modified date. It's key value is really with having the ability to _start_ at a particular _physical_ location. With a Kafka data source, you're exclusively dealing with an event stream where event sourcing patterns leveraging offsets are at play. I wonder though if structured streaming always implied an event source, particularly when streaming from a file source? For example, _modifiedDateFilter_ applies specifically to a point in time when you begin structured streaming on a file data source. You would not have an offset yet in _commitedOffsets_. The offset use case would imply you are restarting an existing, checkpointed stream. When using an offset with a Kafka data source, some write has occurred by which a written checkpoint exists. With the file data source for files that have not yet been read or written, I'm curious how I would apply offset bounds in this way. I was thinking I would have to be reading from a data source that had used structured streaming with checkpointing in order for the offset to exist (committed). Does this make sense? It seems like once you've written a checkpoint while writing to a stream from the readStream dataframe that's loading files, you would have clear context to apply offset-based semantics. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
