tarunguptanit opened a new issue, #6174:
URL: https://github.com/apache/hudi/issues/6174
I have a Hudi table that was created using Hudi 0.5.3.
The table is partitioned by year/month/date.
We recently upgraded the Hudi library to use Hudi 0.9.0. We started noticing
performance issues while reading.
Seems like partition pruning is not happening when reading through Hudi
0.9.0.
The below operation through Hudi 0.5.3 takes ~1 second.
```
scala> val hudiDirectory =
"s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS/2022/05/04/"
hudiDirectory: String =
s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS/2022/05/04/
scala> val hoodieIncrementalView =
spark.read.format("org.apache.hudi").load(hudiDirectory)
22/07/22 00:55:54 INFO DefaultSource: Constructing hoodie (as parquet) data
source with options :Map(hoodie.datasource.view.type -> read_optimized, path ->
s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS/2022/05/04/)
22/07/22 00:55:55 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS
22/07/22 00:55:55 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://ip-10-0-1-118.ec2.internal:8020], Config:[Configuration:
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml,
__spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml],
FileSystem: [S3AFileSystem{uri=s3a://podsofaupgradetesting-v2,
workingDir=s3a://podsofaupgradetesting-v2/user/hadoop, inputPolicy=normal,
partSize=104857600, enableMultiObjectsDelete=true, maxKeys=5000,
readAhead=65536, blockSize=33554432, multiPartThreshold=2147483647,
boundedExecutor=BlockingThreadPoolExecutorService{SemaphoredDelegatingExecutor{permitCount=25,
available=25, waiting=0}, activeCount=0},
unboundedExecutor=java.util.concurrent.ThreadPoolExecutor@68032a80[Running,
pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0],
statistics {93 bytes read, 0 bytes written, 7 read ops, 0 large read ops, 0
write ops}, metrics {{
Context=S3AFileSystem}
{FileSystemId=d9111a33-e2a3-4cd5-b278-fbff8ed2341a-podsofaupgradetesting-v2}
{fsURI=s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS/2022/05/04}
{files_created=0} {files_copied=0} {files_copied_bytes=0} {files_deleted=0}
{fake_directories_deleted=0} {directories_created=0} {directories_deleted=0}
{ignored_errors=0} {op_copy_from_local_file=0} {op_exists=3}
{op_get_file_status=6} {op_glob_status=0} {op_is_directory=1} {op_is_file=0}
{op_list_files=0} {op_list_located_status=0} {op_list_status=1} {op_mkdirs=0}
{op_rename=0} {object_copy_requests=0} {object_delete_requests=0}
{object_list_requests=5} {object_continue_list_requests=0}
{object_metadata_requests=10} {object_multipart_aborted=0} {object_put_bytes=0}
{object_put_requests=0} {object_put_requests_completed=0}
{stream_write_failures=0} {stream_write_block_uploads=0}
{stream_write_block_uploads_committed=0} {stream_write_block_uploads_aborted=0}
{stream_write_total_time=0} {stream_write_total_data=0} {obje
ct_put_requests_active=0} {object_put_bytes_pending=0}
{stream_write_block_uploads_active=0} {stream_write_block_uploads_pending=0}
{stream_write_block_uploads_data_pending=0} {stream_read_fully_operations=0}
{stream_opened=1} {stream_bytes_skipped_on_seek=0} {stream_closed=1}
{stream_bytes_backwards_on_seek=0} {stream_bytes_read=93}
{stream_read_operations_incomplete=1} {stream_bytes_discarded_in_abort=0}
{stream_close_operations=1} {stream_read_operations=1} {stream_aborted=0}
{stream_forward_seek_operations=0} {stream_backward_seek_operations=0}
{stream_seek_operations=0} {stream_bytes_read_in_close=0}
{stream_read_exceptions=0} }}]
22/07/22 00:55:55 INFO HoodieTableConfig: Loading dataset properties from
s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS/.hoodie/hoodie.properties
22/07/22 00:55:55 INFO HoodieTableMetaClient: Finished Loading Table of type
COPY_ON_WRITE from s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS
22/07/22 00:55:56 INFO HoodieActiveTimeline: Loaded instants
java.util.stream.ReferencePipeline$Head@5af850f1
22/07/22 00:55:56 INFO HoodieTableFileSystemView: Adding file-groups for
partition :2022/05/04, #FileGroups=3
22/07/22 00:55:56 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=4, FileGroupsCreationTime=15, StoreTimeTaken=0
22/07/22 00:55:56 INFO HoodieROTablePathFilter: Based on hoodie metadata
from base path: s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS, caching 3
files under s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS/2022/05/04
hoodieIncrementalView: org.apache.spark.sql.DataFrame =
[_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 143 more fields]
```
The same set of commands through Hudi 0.9.0 take ~15 seconds. Note that
Spark is spinning up 1500 tasks, which is why I think partition pruning is not
happening.
```
scala> val hudiDirectory =
"s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS/2022/05/04/"
hudiDirectory: String =
s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS/2022/05/04/
scala> val hoodieIncrementalView =
spark.read.format("org.apache.hudi").load(hudiDirectory)
[Stage 27:> (0 + 0)
/ 1]
scala> val hoodieIncrementalView =
spark.read.format("org.apache.hudi").load(hudiDirectory)
[Stage 29:=====================================================>(221 + 3) /
224]
scala> val hoodieIncrementalView =
spark.read.format("org.apache.hudi").load(hudiDirectory)
[Stage 30:=======================================> (1191 + 176) /
1500]
scala> val hoodieIncrementalView =
spark.read.format("org.apache.hudi").load(hudiDirectory)
hoodieIncrementalView: org.apache.spark.sql.DataFrame =
[_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 143 more fields]
```
Initially, I was getting a WARN message `WARN HoodieFileIndex: No partition
columns available from hoodie.properties. Partition pruning will not work`. I
added the configuration `hoodie.table.partition.fields=partition_date` to the
table's hoodie.properties. The WARN disappeared after that, but it still takes
~15 seconds to read a single partition.
I tried to upgrade the hudi table through Hudi cli using `upgrade table
--toVersion TWO` however that is running into problems similar like
https://github.com/apache/hudi/issues/3894.
I want to understand if the read performance impact can be restored by
adding any configuration.
Please let me know if you need any additional information to troubleshoot
this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]