codejoyan opened a new issue #3581:
URL: https://github.com/apache/hudi/issues/3581


   Environment
   Hudi Version - 0.7
   Spark - 2.4.7
   DFS - Google Cloud storage
   
   **Issue**
   
   The hudi table is partitioned by day and has around 95 partitions. Each 
partition is between 5 GB to 15 GB and total size is around 930 GB. When I fire 
a query (count(*), count(distinct), select * ) on a single day partition, with 
default configurations in Hudi 0.8.0 it takes around 3 mins. This was very slow 
so I tried below 2 approaches.
   1. Compression
   2. Use hoodie.file.index.enable property 
   
   **Sample Query:**
   select count(distinct visit_nbr) from cntry_visit_hudi_tgt where cntry = 
'US' and part_date = '2020-11-02'
   
   bash-4.2$ gsutil du -sh gs://hudi-storage/hudi_table_test_v1
   931.61 GiB   gs://hudi-storage/hudi_table_test_v1
   
   gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-02/
   7.07 GiB
   gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-03/
   6.93 GiB 
   gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-04/
   6.14 GiB
   gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-05/
   6.61 GiB
   gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-06/
   7.42 GiB
   gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-07/
   8.58 GiB
   gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-08/
   8.11 GiB
   gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-09/
   6.78 GiB
   gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-10/
   6.63 GiB
   
   ![image 
(1)](https://user-images.githubusercontent.com/48707638/131673155-ccaeb72b-8d03-4f1a-bc3f-60cc2e845d06.png)
   
   Approach 1:
   Used snappy compression with 0.95% compression ratio in 0.8.0 while writing 
the data. This reduced the table size to 540 GB
   and there was marginal improvement. From Spark UI it seemed file listing is 
taking time. But the "hoodie.file.index.enable" property is not available in 
0.8.0. 
    
   bash-4.2$ gsutil du -sh gs://hudi-storage/hudi_table_test_v1
   540.21 GiB   gs://hudi-storage/hudi_table_test_v1
   
   option("hoodie.parquet.compression.codec", "SNAPPY").
   option("hoodie.parquet.compression.ratio", "0.95").
   
   <img width="1703" alt="Screenshot 2021-09-01 at 6 12 04 PM" 
src="https://user-images.githubusercontent.com/48707638/131673053-4c1bb20c-9c86-48a2-98af-94da9417ec84.png";>
   
   Approach 2:
   Tried using hudi-spark-bundle_2.12-0.9.0-SNAPSHOT.jar and used 
"hoodie.file.index.enable" = 4 while reading
   
   ```
           val df = spark.read.format("hudi")
             .option("hoodie.file.index.enable", true)
             .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
             .load("gs://hudi-storage/hudi_table_test_v1")
           df.createOrReplaceTempView("cntry_visit_hudi_tgt")
           val df = spark.sql(select count(distinct visit_nbr) from 
cntry_visit_hudi_tgt where cntry = 'US' and part_date = 
'2020-11-02').show(false)
   ```
   <img width="1709" alt="Screenshot 2021-09-01 at 6 13 50 PM" 
src="https://user-images.githubusercontent.com/48707638/131673284-34e1a495-d4e0-41e7-a9c3-9b5bee501400.png";>
   Still it appears to take time during file listing. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to