jklim96 opened a new issue, #7254:
URL: https://github.com/apache/hudi/issues/7254

   **Describe the problem you faced**
   
   Hi all,
   I have a question about the performance of incremental queries. I'm 
comparing the performance between running incremental queries and simply doing 
a filter on the `_hoodie_commit_time` column. From initial investigations, it 
looks like for deltas with a small number of commits, incremental queries are 
more performant but as we increase the number of commits, the incremental 
queries actually take longer to complete compared to a column filter. Does 
anyone know why I'd be seeing this behaviour and whether it's expected? My 
expectation would have been that incremental queries will be more performant 
than filters in all scenarios as it would be scanning less data.
   
   
   here are some results of the performance of the two:
   ```
                1 commit        10 commits      20 commits
   Incremental  2'05"           3'12"           4'28"
   Filter               2'55"           2'51"           3'03
   ```
   attaching code snippets for reference:
   ```
   incremental queries:
       beginTime = '20220928105015966'
   
       incremental_read_options = {
           'hoodie.datasource.query.type': 'incremental',
           'hoodie.datasource.read.begin.instanttime': beginTime,
       }
   
       df = spark.read.format("hudi"). \
           options(**incremental_read_options). \
           load("s3://bucketpath/")
   
       df.groupby("column_name").count().collect()
   
   filter query:
       df = spark.read.format("hudi"). \
           load("s3://bucketpath/") 
   
       df.where("_hoodie_commit_time >= 
'20220928105015966'").groupby("column_name").count().collect()
   ```
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create an EMR Serverless Job on AWS
   2. Run the code specified in the section above
   
   **Expected behavior**
   
   Incremental queries are more performant than filter queries
   
   **Environment Description**
   
   * Hudi version : 0.11.1-amzn-0
   
   * Spark version : 3.3.0
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : N/A
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   N/A
   
   **Stacktrace**
   
   N/A
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to