[GitHub] [hudi] Junyewu opened a new issue, #7417: [SUPPORT] With HoodieROTablePathFilter is too slow load normal parquets in hudi release

GitBox Thu, 08 Dec 2022 23:56:33 -0800


Junyewu opened a new issue, #7417:
URL: https://github.com/apache/hudi/issues/7417


   
   **Describe the problem you faced**
   
   with the HoodieROTablePathFilter  load normal parquet file, it will be too 
slow when  reaches a certain order of magnitude
   
   For example：500 partitions and 50000 data files
   
   data path: s3://bucket1/{baseDir}/{partitionDir}/{partitionDir}/{data file}
   
   
   
   **To Reproduce**
   Steps to reproduce the behavior:
   1. submit spark application
   ```
   spark-sql --master yarn \
   --conf 
spark.hadoop.mapreduce.input.pathFilter.class=org.apache.hudi.hadoop.HoodieROTablePathFilter
   ```
   
   2. create temp view
   ```
   create or replace temporary view {user_view} using parquet options (path 
"s3://bucket1/{baseDir}/");
   ```
   
   Then slow load occurs
   
   
   **Environment Description**
   
   * Hudi version : 0.10.0
   
   * Spark version : 3.1.1
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) :  S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   use  the PR [https://github.com/apache/hudi/pull/3719] will  mitigate this 
problem，again run 
   ```
   create or replace temporary view {user_view} using parquet options (path 
"s3://bucket1/{baseDir}/");
   ```
   can finished in about 60 seconds
   ```
   22/12/09 14:01:41 WARN SharedInMemoryCache: Evicting cached table partition 
metadata from memory due to size constraints 
(spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may 
impact query planning performance.
   
   Time taken: 61.771 seconds
   ```
   
   
   At the same time，we have not repeated the problem 
[https://github.com/apache/hudi/issues/4188].  In our spark cluster，[HUDI-3719] 
this PR has used to query partition tables for half a year，such as：
   ```
   ==create table==
   CREATE EXTERNAL TABLE `pickinglogs`(
     `_hoodie_commit_time` string COMMENT '',
     `_hoodie_commit_seqno` string COMMENT '',
     `_hoodie_record_key` string COMMENT '',
     `_hoodie_partition_path` string COMMENT '',
     `_hoodie_file_name` string COMMENT '',
     `id` string COMMENT 'ID',
   
   .......
   
     `meta_es_offset` string COMMENT '',
     `meta_type` string COMMENT '',
     `meta_status` int COMMENT '',
     `meta_md5` string COMMENT '',
     `ptk_time_create` string COMMENT '')
   PARTITIONED BY (
     `year` string COMMENT '',
     `month` string COMMENT '',
     `day` string COMMENT '')
   ROW FORMAT SERDE
     'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
   STORED AS INPUTFORMAT
     'org.apache.hudi.hadoop.HoodieParquetInputFormat'
   OUTPUTFORMAT
     'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
   LOCATION
   
   
   
   ==query for sparksql==
   spark-sql> select count(id) from pickinglogs where year=2022 and month 
between '08' and '10';
   441834287
   Time taken: 22.095 seconds, Fetched 1 row(s)
   
   ```
   
   
   
   **Stacktrace**
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] Junyewu opened a new issue, #7417: [SUPPORT] With HoodieROTablePathFilter is too slow load normal parquets in hudi release

Reply via email to