[
https://issues.apache.org/jira/browse/HUDI-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433587#comment-17433587
]
Raymond Xu commented on HUDI-1970:
----------------------------------
* 1B records (randomized values in the example trip model)
* 100 partitions, evenly distributed, year=*/month=*/day=*, 50 parquet files /
partition
* EMR 6.2 Spark 3.0.1-amzn-0
* S3, parquet compression snappy
* hudi: 109.8 GB = 22.4 MB parquet x 5000
* delta: 70.9 GB = 14.5 MB parquet x 5000
|SQL|Hudi 0.9.0|
|select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare >
20.0|129.352|108.312|104.914|
|select count(*) from hudi_trips_snapshot|96.001|83.839|66.973|
|select count(*) from hudi_trips_snapshot where year = '2020' and month = '03'
and day = '01'|1.880|1.776|1.767|
|select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where
year='2020' and month='03' and day='01' and fare between 20 and
50|3.650|3.147|3.086|
> Performance testing/certification of key SQL DMLs
> -------------------------------------------------
>
> Key: HUDI-1970
> URL: https://issues.apache.org/jira/browse/HUDI-1970
> Project: Apache Hudi
> Issue Type: Sub-task
> Components: Performance, Spark Integration
> Reporter: Vinoth Chandar
> Assignee: Raymond Xu
> Priority: Blocker
> Fix For: 0.10.0
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)