rubenssoto commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-677672123


   I made more tests, but now with the same table, only difference is partition 
strategy
   I use Athena.
   
   Table01 with regular parquet
   
   query:
   select city,origin, count(1) from 
   parquet_demand_coverage where created_date_brt >= '2020-01-01'
   group by city,origin
   order by count(1) desc
   limit 20
   
   Time to Execute: 6.19 seconds
   Table Size: 35.7gb
   Number of Partitions: 693
   Number of Files: 916
   Partition by: day
   Data Scanned by Athena: 512mb
   
   
   Table02 with Hudi
   
   
   query:
   select city,origin, count(1) from 
   demand_coverage where created_year_month_brt >= '2020-01-01'
   group by city,origin
   order by count(1) desc
   limit 20
   
   Time to Execute: 18.77 seconds
   Table size: 59gb (The bigger size is because Hudi keep commit files, but the 
original size is almost the same)
   Number Of partitions: 24
   Number Of files: 124
   Data Scanned by Athena: 480mb
   
   Its a big performance difference


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to