[GitHub] [hudi] BruceKellan opened a new issue, #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

GitBox Tue, 10 Jan 2023 23:29:42 -0800


BruceKellan opened a new issue, #7643:
URL: https://github.com/apache/hudi/issues/7643


   **Describe the problem you faced**
   
   We are testing the Hudi Connector on copy-on-write table using trino405 
(latest stable version), but we ran into serious performance problem. 
   We will have a very large number of partitions in a table and we made a 
minimal test set for this.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   test table data:
   
[hudi_reproduce.tar.gz](https://github.com/apache/hudi/files/10389539/hudi_reproduce.tar.gz)
   desc: 
   This test table has many partitions and parititoined by day, type. There are 
657 data in total.
   
   <img width="342" alt="image" 
src="https://user-images.githubusercontent.com/13477122/211738768-09bc6156-bb49-4e0f-82ba-5c058220ee89.png";>
   <img width="690" alt="image" 
src="https://user-images.githubusercontent.com/13477122/211742605-050b65b0-d09b-4618-90a5-98b5ebd3f8a1.png";>
   
   
   1. Import data and run a hiveql to repair partitions.
   ```sql
   CREATE EXTERNAL TABLE `website.hudi_reproduce`(
   `_hoodie_commit_time` string,
   `_hoodie_commit_seqno` string,
   `_hoodie_record_key` string,
   `_hoodie_partition_path` string,
   `_hoodie_file_name` string,
   `uniquekey` string)
   PARTITIONED BY (
   `day` bigint,
   `type` bigint)
   ROW FORMAT SERDE
   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
   WITH SERDEPROPERTIES (
   'hoodie.query.as.ro.table'='false',
   'path'='hdfs://xxx/hudi/warehouse/hudi_reproduce')
   STORED AS INPUTFORMAT
   'org.apache.hudi.hadoop.HoodieParquetInputFormat'
   OUTPUTFORMAT
   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
   LOCATION
   'hdfs://xxx/hudi/warehouse/hudi_reproduce'
   TBLPROPERTIES (
   'last_commit_time_sync'='20230111113655773',
   'last_modified_time'='1673406649',
   'spark.sql.sources.provider'='hudi',
   'spark.sql.sources.schema.numPartCols'='2',
   'spark.sql.sources.schema.numParts'='1',
   
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"uniqueKey","type":"string","nullable":true,"metadata":{}},{"name":"day","type":"long","nullable":true,"metadata":{}},{"name":"type","type":"long","nullable":true,"metadata":{}}]}',
   'spark.sql.sources.schema.partCol.0'='day',
   'spark.sql.sources.schema.partCol.1'='type',
   'transient_lastDdlTime'='1673406649');
   
   -- repair partitions.
   msck repair table website.hudi_reproduce;
   ```
   
   2. Run trino sql to query:
   ```sql
   -- we want to query the data that type was between 1 and 9 and day between 
20230101 and 20230104
   select count(1) from hudi.website.hudi_reproduce where day between 20230101 
and 20230104 and type between 1 and 9;
   ```
   
   3. Query too slow:
   <img width="1261" alt="image" 
src="https://user-images.githubusercontent.com/13477122/211741572-b58cbc3f-50c5-4fb0-8ad7-83cf53a12e81.png";>
   
   **Expected behavior**
   
   Can query as fast as hive table.
   
   **Environment Description**
   
   * Hudi version : 0.12.1
   
   * Hive version : 2.3.9
   
   * Hadoop version : 2.8.5
   
   * Trino version: 405
   
   * Number of trino worker: 8
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   **Additional context**
   
   Share our trino server.log, hope this helps you.
   
[hudi_reproduce_trino_server_log.log](https://github.com/apache/hudi/files/10389658/hudi_reproduce_trino_server_log.log)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] BruceKellan opened a new issue, #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Reply via email to