codejoyan opened a new issue #3581: URL: https://github.com/apache/hudi/issues/3581
Environment Hudi Version - 0.7 Spark - 2.4.7 DFS - Google Cloud storage **Issue** The hudi table is partitioned by day and has around 95 partitions. Each partition is between 5 GB to 15 GB and total size is around 930 GB. When I fire a query (count(*), count(distinct), select * ) on a single day partition, with default configurations in Hudi 0.8.0 it takes around 3 mins. This was very slow so I tried below 2 approaches. 1. Compression 2. Use hoodie.file.index.enable property **Sample Query:** select count(distinct visit_nbr) from cntry_visit_hudi_tgt where cntry = 'US' and part_date = '2020-11-02' bash-4.2$ gsutil du -sh gs://hudi-storage/hudi_table_test_v1 931.61 GiB gs://hudi-storage/hudi_table_test_v1 gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-02/ 7.07 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-03/ 6.93 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-04/ 6.14 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-05/ 6.61 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-06/ 7.42 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-07/ 8.58 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-08/ 8.11 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-09/ 6.78 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-10/ 6.63 GiB  Approach 1: Used snappy compression with 0.95% compression ratio in 0.8.0 while writing the data. This reduced the table size to 540 GB and there was marginal improvement. From Spark UI it seemed file listing is taking time. But the "hoodie.file.index.enable" property is not available in 0.8.0. bash-4.2$ gsutil du -sh gs://hudi-storage/hudi_table_test_v1 540.21 GiB gs://hudi-storage/hudi_table_test_v1 option("hoodie.parquet.compression.codec", "SNAPPY"). option("hoodie.parquet.compression.ratio", "0.95"). <img width="1703" alt="Screenshot 2021-09-01 at 6 12 04 PM" src="https://user-images.githubusercontent.com/48707638/131673053-4c1bb20c-9c86-48a2-98af-94da9417ec84.png"> Approach 2: Tried using hudi-spark-bundle_2.12-0.9.0-SNAPSHOT.jar and used "hoodie.file.index.enable" = 4 while reading ``` val df = spark.read.format("hudi") .option("hoodie.file.index.enable", true) .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) .load("gs://hudi-storage/hudi_table_test_v1") df.createOrReplaceTempView("cntry_visit_hudi_tgt") val df = spark.sql(select count(distinct visit_nbr) from cntry_visit_hudi_tgt where cntry = 'US' and part_date = '2020-11-02').show(false) ``` <img width="1709" alt="Screenshot 2021-09-01 at 6 13 50 PM" src="https://user-images.githubusercontent.com/48707638/131673284-34e1a495-d4e0-41e7-a9c3-9b5bee501400.png"> Still it appears to take time during file listing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
