[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL][SS] Provide option to load files after a specified date when reading from a folder path

GitBox Sun, 28 Jun 2020 02:51:34 -0700


cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-650726701



   > Please take a look at how Kafka data source options apply with both batch 
and streaming query. The semantic of the option should be applied differently.
   > 
   > 
http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries
   > 
   > `startingOffsetsByTimestamp`, `startingOffsets`, 
`endingOffsetsByTimestamp`, `endingOffsets`
   > 
   > If we are not fully sure about how to do it, let's only apply the option 
to batch query, and file an issue to address for the streaming query.
   > 
   > Btw, that said, I prefer to have lower bound + upper bound instead of only 
lower bound, as commented earlier on reviewing.
   
   @HeartSaVioR 
   
   Hmm, I'm wondering if this isn't a different feature.  The goal of this 
feature is to begin reading from a file data source with files that have a 
particular modified date.  It's key value is really with having the ability to 
_start_ at a particular _physical_ location.
   
   With a Kafka data source, you're exclusively dealing with an event stream 
where event sourcing patterns leveraging offsets are at play.  I wonder though 
if structured streaming always implied an event source, particularly when 
streaming from a file source?
   
   For example, _modifiedDateFilter_ applies specifically to a point in time 
when you begin structured streaming on a file data source.  You would not have 
an offset yet in _availableOffsets_.  The offset use case would imply you are 
restarting an existing, checkpointed stream.
   
   When using an offset with a Kafka data source, some write has occurred by 
which a written checkpoint exists.  With the file data source for files that 
have not yet been read or written, I'm curious how I would apply offset bounds 
in this way.  I was thinking I would have to be reading from a data source that 
had used structured streaming with checkpointing in order for the offset to 
exist (committed).  
   
   Does this make sense?  It seems like once you've written a checkpoint while 
writing to a stream from the readStream dataframe that's loading files, you 
would have clear context to apply offset-based semantics.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL][SS] Provide option to load files after a specified date when reading from a folder path

Reply via email to