bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1898187110

   ```scala> spark.time({
        |   val df = spark.read
        |     .format("org.apache.hudi")
        |     .option("hoodie.datasource.query.type", "read_optimized")
        |     .load("s3://path/table/")
        |
        |   val count = df.filter(
        |     (df("year") === 2024) &&
        |     (df("month") === 1) &&
        |     (df("day") === 16) &&
        |     (df("account_id") === "id1")
        |   ).count()
        |
        |   println(s"Count: $count")
        | })
   Count: 47
   Time taken: 30477 ms```
   
   ```
   scala> spark.time({
        |   val df = spark.read
        |     .format("org.apache.hudi")
        |     .option("hoodie.datasource.query.type", "snapshot")
        |     .load("s3://path/table/")
        |
        |   val count = df.filter(
        |     (df("year") === 2024) &&
        |     (df("month") === 1) &&
        |     (df("day") === 16) &&
        |     (df("account_sid") === "id1")
        |   ).count()
        |
        |   println(s"Count: $count")
        | })
   24/01/18 10:06:51 WARN SparkStringUtils: Truncated the string representation 
of a plan since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
   Count: 47
   Time taken: 22594 ms```
   
   For clean experiment, I created 2 separate sessions for queries above.
   
   It's just super confusing as it contradicts the logic. So `read_optimized` 
actually takes more time to load same data as it's done with `snapshot`.
   
   Can we say for sure that use of bloom filters on parquet native filters is 
bluntly not effective for hudi? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to