[GitHub] [hudi] kazdy opened a new issue #3724: [SUPPORT] Spark start reading stream from hudi dataset starting from given commit time

GitBox Mon, 27 Sep 2021 04:19:06 -0700


kazdy opened a new issue #3724:
URL: https://github.com/apache/hudi/issues/3724



   **Describe the problem you faced**
   I wanted to query hudi dataset incrementally using spark streaming and 
simply write stream to console with a trigger (processing time set to 3s). 
   I got it working but the problem I faced was that in the first batch I got 
all the data starting from the first commit. 
   I would like to start readStream from specific commit time (like in flink 
streaming query that hudi supports).
   
   I was looking at the code and it seems that there is no option I can specify 
to get this behavior.
   I know it's not documented yet and the work is in progress, are you planning 
to add such functionality?
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create non-empty hudi dataset
   2. Use spark.readStream.format("hudi").load(basePath) on the data set
   3. Use spark.writeStream.format("console") to write batches with changing 
data to console
   
   **Expected behavior**
   
   I would like to be able to specify from what commit time hudi creates a 
stream of records (like in spark incremental query or flink streaming query).
   First batch of data returned from 
spark.readStream.format("hudi").load(basePath) should start from specified 
commit time.
   
   **Environment Description**
   
   * Hudi version : 0.0.9
   
   * Spark version : 3.1.2
   
   * Hive version : -
   
   * Hadoop version : 3.2
   
   * Storage (HDFS/S3/GCS..) : local storage
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   What I'm trying to do is to obtain changes that are happening in one hudi 
dataset to then create incremental pipeline in spark and process them further. 
   
   If there is a better way of doing this in spark currently, could you please 
guide me? 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] kazdy opened a new issue #3724: [SUPPORT] Spark start reading stream from hudi dataset starting from given commit time

Reply via email to