[GitHub] [hudi] parisni opened a new issue, #7846: [SUPPORT] Datasource incremental subsequent read same as first read

via GitHub Fri, 03 Feb 2023 11:52:36 -0800


parisni opened a new issue, #7846:
URL: https://github.com/apache/hudi/issues/7846


   hudi 0.12.2
   spark 3.2.1
   -----------------
   
   Once an incremental read is made, all subsequent read on the table will 
remain the same. Likely something about incremental is cached
   
   ```scala
   // query 1
   spark.read.format("hudi")
   .option("hoodie.metadata.enable","true")
   .table("database.hudi_table").count()
   // 1000
   
   // query 2
   spark.read.format("hudi")
   .option("hoodie.metadata.enable","true")
   .option("hoodie.datasource.query.type","incremental")
   .option("hoodie.datasource.read.begin.instanttime","20230203191804078")
   .table("database.hudi_table").count()
   // 200
   
   // query 3
   spark.read.format("hudi")
   .option("hoodie.metadata.enable","true")
   .table("database.hudi_table").count()
   // 200 should be 1000
   ```
   
   Also weird, the number of tasks is way smaller if query 2 is run in a fresh 
spark session.
   - query 1 before query 2, then 25k tasks
   - query 2 without query 1, then 17 tasks
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] parisni opened a new issue, #7846: [SUPPORT] Datasource incremental subsequent read same as first read

Reply via email to