[GitHub] [hudi] tarunguptanit opened a new issue, #6174: Partition pruning not happening when reading Hudi table

GitBox Thu, 21 Jul 2022 18:06:51 -0700


tarunguptanit opened a new issue, #6174:
URL: https://github.com/apache/hudi/issues/6174


   I have a Hudi table that was created using Hudi 0.5.3. 
   The table is partitioned by year/month/date.
   
   We recently upgraded the Hudi library to use Hudi 0.9.0. We started noticing 
performance issues while reading. 
   Seems like partition pruning is not happening when reading through Hudi 
0.9.0. 
   
   The below operation through Hudi 0.5.3 takes ~1 second. 
   
   ```
   scala> val hudiDirectory = 
"s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS/2022/05/04/"
   hudiDirectory: String = 
s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS/2022/05/04/
   scala> val hoodieIncrementalView = 
spark.read.format("org.apache.hudi").load(hudiDirectory)
   22/07/22 00:55:54 INFO DefaultSource: Constructing hoodie (as parquet) data 
source with options :Map(hoodie.datasource.view.type -> read_optimized, path -> 
s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS/2022/05/04/)
   22/07/22 00:55:55 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS
   22/07/22 00:55:55 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://ip-10-0-1-118.ec2.internal:8020], Config:[Configuration: 
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, 
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
__spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], 
FileSystem: [S3AFileSystem{uri=s3a://podsofaupgradetesting-v2, 
workingDir=s3a://podsofaupgradetesting-v2/user/hadoop, inputPolicy=normal, 
partSize=104857600, enableMultiObjectsDelete=true, maxKeys=5000, 
readAhead=65536, blockSize=33554432, multiPartThreshold=2147483647, 
boundedExecutor=BlockingThreadPoolExecutorService{SemaphoredDelegatingExecutor{permitCount=25,
 available=25, waiting=0}, activeCount=0}, 
unboundedExecutor=java.util.concurrent.ThreadPoolExecutor@68032a80[Running, 
pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0], 
statistics {93 bytes read, 0 bytes written, 7 read ops, 0 large read ops, 0 
write ops}, metrics {{
 Context=S3AFileSystem} 
{FileSystemId=d9111a33-e2a3-4cd5-b278-fbff8ed2341a-podsofaupgradetesting-v2} 
{fsURI=s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS/2022/05/04} 
{files_created=0} {files_copied=0} {files_copied_bytes=0} {files_deleted=0} 
{fake_directories_deleted=0} {directories_created=0} {directories_deleted=0} 
{ignored_errors=0} {op_copy_from_local_file=0} {op_exists=3} 
{op_get_file_status=6} {op_glob_status=0} {op_is_directory=1} {op_is_file=0} 
{op_list_files=0} {op_list_located_status=0} {op_list_status=1} {op_mkdirs=0} 
{op_rename=0} {object_copy_requests=0} {object_delete_requests=0} 
{object_list_requests=5} {object_continue_list_requests=0} 
{object_metadata_requests=10} {object_multipart_aborted=0} {object_put_bytes=0} 
{object_put_requests=0} {object_put_requests_completed=0} 
{stream_write_failures=0} {stream_write_block_uploads=0} 
{stream_write_block_uploads_committed=0} {stream_write_block_uploads_aborted=0} 
{stream_write_total_time=0} {stream_write_total_data=0} {obje
 ct_put_requests_active=0} {object_put_bytes_pending=0} 
{stream_write_block_uploads_active=0} {stream_write_block_uploads_pending=0} 
{stream_write_block_uploads_data_pending=0} {stream_read_fully_operations=0} 
{stream_opened=1} {stream_bytes_skipped_on_seek=0} {stream_closed=1} 
{stream_bytes_backwards_on_seek=0} {stream_bytes_read=93} 
{stream_read_operations_incomplete=1} {stream_bytes_discarded_in_abort=0} 
{stream_close_operations=1} {stream_read_operations=1} {stream_aborted=0} 
{stream_forward_seek_operations=0} {stream_backward_seek_operations=0} 
{stream_seek_operations=0} {stream_bytes_read_in_close=0} 
{stream_read_exceptions=0} }}]
   22/07/22 00:55:55 INFO HoodieTableConfig: Loading dataset properties from 
s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS/.hoodie/hoodie.properties
   22/07/22 00:55:55 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE from s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS
   22/07/22 00:55:56 INFO HoodieActiveTimeline: Loaded instants 
java.util.stream.ReferencePipeline$Head@5af850f1
   22/07/22 00:55:56 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :2022/05/04, #FileGroups=3
   22/07/22 00:55:56 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=4, FileGroupsCreationTime=15, StoreTimeTaken=0
   22/07/22 00:55:56 INFO HoodieROTablePathFilter: Based on hoodie metadata 
from base path: s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS, caching 3 
files under s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS/2022/05/04
   hoodieIncrementalView: org.apache.spark.sql.DataFrame = 
[_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 143 more fields]
   ```
   
   The same set of commands through Hudi 0.9.0 take ~15 seconds. Note that 
Spark is spinning up 1500 tasks, which is why I think partition pruning is not 
happening. 
   
   ```
   
   scala> val hudiDirectory = 
"s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS/2022/05/04/"
   hudiDirectory: String = 
s3a://podsofaupgradetesting-v2/HZ_CUST_ACCOUNTS/2022/05/04/
   
   scala> val hoodieIncrementalView = 
spark.read.format("org.apache.hudi").load(hudiDirectory)
   [Stage 27:>                                                         (0 + 0) 
/ 1]
   
   
   scala> val hoodieIncrementalView = 
spark.read.format("org.apache.hudi").load(hudiDirectory)
   [Stage 29:=====================================================>(221 + 3) / 
224]
   
   
   scala> val hoodieIncrementalView = 
spark.read.format("org.apache.hudi").load(hudiDirectory)
   [Stage 30:=======================================>          (1191 + 176) / 
1500]
   
   scala> val hoodieIncrementalView = 
spark.read.format("org.apache.hudi").load(hudiDirectory)
   hoodieIncrementalView: org.apache.spark.sql.DataFrame = 
[_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 143 more fields]
   ```
   
   Initially, I was getting a WARN message `WARN HoodieFileIndex: No partition 
columns available from hoodie.properties. Partition pruning will not work`. I 
added the configuration `hoodie.table.partition.fields=partition_date` to the 
table's hoodie.properties. The WARN disappeared after that, but it still takes 
~15 seconds to read a single partition. 
   
   I tried to upgrade the hudi table through  Hudi cli using `upgrade table 
--toVersion TWO` however that is running into problems similar like 
https://github.com/apache/hudi/issues/3894. 
   
   I want to understand if the read performance impact can be restored by 
adding any configuration. 
   
   Please let me know if you need any additional information to troubleshoot 
this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] tarunguptanit opened a new issue, #6174: Partition pruning not happening when reading Hudi table

Reply via email to