[GitHub] [hudi] kazdy opened a new issue, #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

via GitHub Sat, 28 Jan 2023 01:51:43 -0800


kazdy opened a new issue, #7778:
URL: https://github.com/apache/hudi/issues/7778


   **Describe the problem you faced**
   
   I have a hudi table that is read by spark structured streaming job with 
checkpoint enabled and saved to S3.
   When table A is updated and commits are cleaned and the saved checkpoint no 
longer can be read since it no longer exists, Hudi throws NPE.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create hudi table 
   2. Insert data to the table
   3. Consume the table using spark structured streaming and use checkpoint
   4. Insert more data o the hudi table (create a few commits)
   5. Clean commits (leave last one)
   6. Start streaming read again using previously saved checkpoint
   7. Streaming read fails with NPE
   
   **Expected behavior**
   
   Following Kafka structured streaming source it would be good to have "fail 
on data loss" config for spark streaming jobs.
   if failOnDataLoss is true -> throw an error warning about potential data loss
   else -> start reading from the earliest available instant.
   
   **Environment Description**
   
   * Hudi version : 0.12.1 amzn
   
   * Spark version : 3.3.1 amzn
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : emr serverless
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] kazdy opened a new issue, #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

Reply via email to