ganczarek commented on issue #4656:
URL: https://github.com/apache/hudi/issues/4656#issuecomment-1057245313


   @nsivabalan Thank you for your reply.
   
   Regarding your question about table metadata. During write table metadata 
was enabled (`HoodieMetadataConfig.ENABLE.key -> "true"`), but during read I 
disabled it. My initial intuition was to use table metadata, but using it 
didn't bring much improvement. I think that scanning HFile in 
[HoodieHFileReader::getRecordByKey](https://github.com/apache/hudi/blob/69ee790a47a5fa90a6acd954a9330cce3ae31c3b/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java#L249)
 for each partition with disabled block caching may make the whole process 
longer.
   
   I have run `org.apache.hudi.utilities.HoodieCleaner` with configs that you 
suggested, but clean operation has done nothing and finished after 30 seconds:
   ```
   22/03/02 16:27:39 INFO AbstractTableFileSystemView: Took 9902 ms to read  17 
instants, 15201 replaced file groups
   22/03/02 16:27:39 INFO ClusteringUtils: Found 0 files in pending clustering 
operations
   22/03/02 16:27:39 INFO S3NativeFileSystem: Opening 
's3://bucket/table/.hoodie/20220124110227018.clean' for reading
   22/03/02 16:27:40 INFO CleanPlanner: Incremental Cleaning mode is enabled. 
Looking up partition-paths that have since changed since last cleaned at 
20220119150624588. New Instant to retain : 
Option{val=[20220119150624588__commit__COMPLETED]}
   22/03/02 16:27:40 INFO CleanPlanner: Nothing to clean here. It is already 
clean
   ```
   
   I lowered config values and run HoodieCleaner again. This time I could see 
that it actually did something. Config parameters that I have used:
   ```
   hoodie.cleaner.commits.retained = 5
   hoodie.keep.min.commits = 6
   hoodie.keep.max.commits = 7
   ```
   
   I can see that during read it loads the latest instance 
(`20220302163151203__clean__COMPLETED`), but it had no impact on reading 
performance whatsoever:
   ```
   22/03/02 16:35:27 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20220302163151203__clean__COMPLETED]}
   22/03/02 16:35:27 INFO FileSystemViewManager: Creating InMemory based view 
for basePath s3://bucket/table
   22/03/02 16:35:35 INFO AbstractTableFileSystemView: Took 8784 ms to read  17 
instants, 15201 replaced file groups
   22/03/02 16:35:35 INFO ClusteringUtils: Found 0 files in pending clustering 
operations
   22/03/02 16:35:35 INFO AbstractTableFileSystemView: Building file system 
view for partition (date=2022-01-01/auditsource=auth/audittype=requestreceived)
   22/03/02 16:35:35 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=40, NumFileGroups=39, FileGroupsCreationTime=3, StoreTimeTaken=0
   22/03/02 16:35:35 INFO HoodieROTablePathFilter: Based on hoodie metadata 
from base path: s3://bucket/table, caching 39 files under 
s3://bucket/table/date=2022-01-01/source=test/type=test
   22/03/02 16:35:44 INFO AbstractTableFileSystemView: Took 8541 ms to read  17 
instants, 15201 replaced file groups
   ```
   
   I also tested reading the table with the latest version of Hudi `v0.10.1`.  
It improved a read time from 132 to 65 seconds, but that's still a considerable 
amount of time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to