rubenssoto commented on issue #1981: URL: https://github.com/apache/hudi/issues/1981#issuecomment-677672123
I made more tests, but now with the same table, only difference is partition strategy I use Athena. Table01 with regular parquet query: select city,origin, count(1) from parquet_demand_coverage where created_date_brt >= '2020-01-01' group by city,origin order by count(1) desc limit 20 Time to Execute: 6.19 seconds Table Size: 35.7gb Number of Partitions: 693 Number of Files: 916 Partition by: day Data Scanned by Athena: 512mb Table02 with Hudi query: select city,origin, count(1) from demand_coverage where created_year_month_brt >= '2020-01-01' group by city,origin order by count(1) desc limit 20 Time to Execute: 18.77 seconds Table size: 59gb (The bigger size is because Hudi keep commit files, but the original size is almost the same) Number Of partitions: 24 Number Of files: 124 Data Scanned by Athena: 480mb Its a big performance difference ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
