kazdy opened a new issue, #7778: URL: https://github.com/apache/hudi/issues/7778
**Describe the problem you faced** I have a hudi table that is read by spark structured streaming job with checkpoint enabled and saved to S3. When table A is updated and commits are cleaned and the saved checkpoint no longer can be read since it no longer exists, Hudi throws NPE. **To Reproduce** Steps to reproduce the behavior: 1. Create hudi table 2. Insert data to the table 3. Consume the table using spark structured streaming and use checkpoint 4. Insert more data o the hudi table (create a few commits) 5. Clean commits (leave last one) 6. Start streaming read again using previously saved checkpoint 7. Streaming read fails with NPE **Expected behavior** Following Kafka structured streaming source it would be good to have "fail on data loss" config for spark streaming jobs. if failOnDataLoss is true -> throw an error warning about potential data loss else -> start reading from the earliest available instant. **Environment Description** * Hudi version : 0.12.1 amzn * Spark version : 3.3.1 amzn * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : emr serverless **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
