codope commented on issue #12477: URL: https://github.com/apache/hudi/issues/12477#issuecomment-2547586697
> My assumption is that in Spark SQL we are unable to set `hoodie.file.index.enable` as false and thus the error of `FileNotFoundException` ocuurs. You could set it as a spark session config. > Scenario is a read SQL is setup and while the read operation is underway an independent write operation to the same table is done which causes a failure on the initial read operation initiated through Spark SQL. Generally speaking, Hudi guarantees snapshot isolation between writers and readers through its timeline and multi-version concurrency control. And Hudi does not delete the last version of any data file unless the cleaner is configured that way (your configs suggest no change to the default cleaner configs). I would like to understand more about your use case and also how the file is getting deleted? Are you using OSS Hudi or EMR Hudi? If it's the latter, did you also try with the 0.15.0 version of OSS Hudi? Could you zip the `.hoodie` folder under the base path of the erroneous table and share it with us? We have many production use cases with concurrent read and write scenario, and data freshness latency of just a few minutes. For example - https://aws.amazon.com/blogs/big-data/how-nerdwallet-uses-aws-and-apache-hudi-to-build-a-serverless-real-time-analytics-platform/ If it's just single writer and multiple readers, Hudi employs MVCC by default. I will need to review the script shared above to understand further what's going on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
