Junyewu opened a new issue, #7417:
URL: https://github.com/apache/hudi/issues/7417
**Describe the problem you faced**
with the HoodieROTablePathFilter load normal parquet file, it will be too
slow when reaches a certain order of magnitude
For example:500 partitions and 50000 data files
data path: s3://bucket1/{baseDir}/{partitionDir}/{partitionDir}/{data file}
**To Reproduce**
Steps to reproduce the behavior:
1. submit spark application
```
spark-sql --master yarn \
--conf
spark.hadoop.mapreduce.input.pathFilter.class=org.apache.hudi.hadoop.HoodieROTablePathFilter
```
2. create temp view
```
create or replace temporary view {user_view} using parquet options (path
"s3://bucket1/{baseDir}/");
```
Then slow load occurs
**Environment Description**
* Hudi version : 0.10.0
* Spark version : 3.1.1
* Hive version : 3.1.2
* Hadoop version : 3.2.1
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
**Additional context**
use the PR [https://github.com/apache/hudi/pull/3719] will mitigate this
problem,again run
```
create or replace temporary view {user_view} using parquet options (path
"s3://bucket1/{baseDir}/");
```
can finished in about 60 seconds
```
22/12/09 14:01:41 WARN SharedInMemoryCache: Evicting cached table partition
metadata from memory due to size constraints
(spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may
impact query planning performance.
Time taken: 61.771 seconds
```
At the same time,we have not repeated the problem
[https://github.com/apache/hudi/issues/4188]. In our spark cluster,[HUDI-3719]
this PR has used to query partition tables for half a year,such as:
```
==create table==
CREATE EXTERNAL TABLE `pickinglogs`(
`_hoodie_commit_time` string COMMENT '',
`_hoodie_commit_seqno` string COMMENT '',
`_hoodie_record_key` string COMMENT '',
`_hoodie_partition_path` string COMMENT '',
`_hoodie_file_name` string COMMENT '',
`id` string COMMENT 'ID',
.......
`meta_es_offset` string COMMENT '',
`meta_type` string COMMENT '',
`meta_status` int COMMENT '',
`meta_md5` string COMMENT '',
`ptk_time_create` string COMMENT '')
PARTITIONED BY (
`year` string COMMENT '',
`month` string COMMENT '',
`day` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
==query for sparksql==
spark-sql> select count(id) from pickinglogs where year=2022 and month
between '08' and '10';
441834287
Time taken: 22.095 seconds, Fetched 1 row(s)
```
**Stacktrace**
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]