Re: [PR] [HUDI-9590] Enable OptimizedLogBlockScan support for FileGroup Reader and LogRecordScanner [hudi]

via GitHub Thu, 24 Jul 2025 07:14:10 -0700


lokeshj1703 commented on code in PR #13591:
URL: https://github.com/apache/hudi/pull/13591#discussion_r2228545794



##########
hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java:
##########
@@ -240,15 +241,17 @@ storage, new StoragePath(logFilePathPattern)).stream()
       HoodieFileGroupId fileGroupId = new 
HoodieFileGroupId(FSUtils.getRelativePartitionPath(HoodieCLI.getTableMetaClient().getBasePath(),
 firstLogFile), FSUtils.getFileIdFromLogPath(firstLogFile));
       FileSlice fileSlice = new FileSlice(fileGroupId, 
HoodieTimeline.INIT_INSTANT_TS, null, logFilePaths.stream()
           .map(l -> new HoodieLogFile(new 
StoragePath(l))).collect(Collectors.toList()));
+      TypedProperties fileGroupReaderProperties = 
buildFileGroupReaderProperties();
       try (HoodieFileGroupReader<IndexedRecord> fileGroupReader = 
HoodieFileGroupReader.<IndexedRecord>newBuilder()
           .withReaderContext(readerContext)
           .withHoodieTableMetaClient(HoodieCLI.getTableMetaClient())
           .withFileSlice(fileSlice)
           .withDataSchema(readerSchema)
           .withRequestedSchema(readerSchema)
           
.withLatestCommitTime(client.getActiveTimeline().getCommitAndReplaceTimeline().lastInstant().map(HoodieInstant::requestedTime).orElse(HoodieInstantTimeGenerator.getCurrentInstantTimeStr()))
-          .withProps(buildFileGroupReaderProperties())
+          .withProps(fileGroupReaderProperties)
           .withShouldUseRecordPosition(false)
+          
.withEnableOptimizedLogBlockScan(getEnableOptimizedLogBlocksScan(fileGroupReaderProperties))

Review Comment:
   I think we are always setting this to default value. Should we use the same 
here and remove the other changes?
   `ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN.defaultValue()`



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedAppendHandle.java:
##########
@@ -85,8 +85,10 @@ public void doAppend() {
     // Initializes the record iterator, log compaction requires writing the 
deletes into the delete block of the resulting log file.
     try (HoodieFileGroupReader<T> fileGroupReader = 
HoodieFileGroupReader.<T>newBuilder().withReaderContext(readerContext).withHoodieTableMetaClient(hoodieTable.getMetaClient())
         
.withLatestCommitTime(instantTime).withPartitionPath(partitionPath).withLogFiles(logFiles).withBaseFileOption(Option.empty()).withDataSchema(writeSchemaWithMetaFields)
-        
.withRequestedSchema(writeSchemaWithMetaFields).withEnableOptimizedLogBlockScan(true).withInternalSchema(internalSchemaOption).withProps(props).withEmitDelete(true)
-        
.withShouldUseRecordPosition(usePosition).withSortOutput(hoodieTable.requireSortedRecords()).build())
 {
+        
.withRequestedSchema(writeSchemaWithMetaFields).withInternalSchema(internalSchemaOption).withProps(props).withEmitDelete(true)
+        
.withShouldUseRecordPosition(usePosition).withSortOutput(hoodieTable.requireSortedRecords())
+        // instead of using config.enableOptimizedLogBlocksScan(), we set to 
true as log compaction blocks only supported in scanV2
+        .withEnableOptimizedLogBlockScan(true).build()) {

Review Comment:
   I did not get why we are setting it to true here. In most of the places it 
is being set using a config.



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -895,7 +895,7 @@ public static <T> HoodieData<HoodieRecord> 
convertMetadataToRecordIndexRecords(H
                   .collect(Collectors.toList());
               // Extract revived and deleted keys
               Pair<Set<String>, Set<String>> revivedAndDeletedKeys = 
getRevivedAndDeletedKeysFromMergedLogs(dataTableMetaClient, instantTime, 
allLogFilePaths, finalWriterSchemaOpt,
-                  currentLogFilePaths, partitionPath, 
readerContextFactory.getContext());
+                  currentLogFilePaths, partitionPath, 
readerContextFactory.getContext(), metadataConfig);

Review Comment:
   This should use the write config property since it is reading the data files



##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/PartitionBucketIndexManager.scala:
##########
@@ -195,7 +196,8 @@ class PartitionBucketIndexManager extends BaseProcedure
       logInfo("Perform OVERWRITE with dry-run disabled.")
       val partitionsToRescale = rescalePartitionsMap.keys
       // get all fileSlices need to read
-      val allFilesMap = FSUtils.getFilesInPartitions(context, metaClient, 
HoodieMetadataConfig.newBuilder.enable(mdtEnable).build,
+      val metadataConfig = 
HoodieMetadataConfig.newBuilder.enable(mdtEnable).build
+      val allFilesMap = FSUtils.getFilesInPartitions(context, metaClient, 
metadataConfig,

Review Comment:
   Redundant change?



##########
hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java:
##########
@@ -240,15 +241,17 @@ storage, new StoragePath(logFilePathPattern)).stream()
       HoodieFileGroupId fileGroupId = new 
HoodieFileGroupId(FSUtils.getRelativePartitionPath(HoodieCLI.getTableMetaClient().getBasePath(),
 firstLogFile), FSUtils.getFileIdFromLogPath(firstLogFile));
       FileSlice fileSlice = new FileSlice(fileGroupId, 
HoodieTimeline.INIT_INSTANT_TS, null, logFilePaths.stream()
           .map(l -> new HoodieLogFile(new 
StoragePath(l))).collect(Collectors.toList()));
+      TypedProperties fileGroupReaderProperties = 
buildFileGroupReaderProperties();

Review Comment:
   NIT: We can probably call it defaultFileGroupReaderProperties



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -895,7 +895,7 @@ public static <T> HoodieData<HoodieRecord> 
convertMetadataToRecordIndexRecords(H
                   .collect(Collectors.toList());
               // Extract revived and deleted keys
               Pair<Set<String>, Set<String>> revivedAndDeletedKeys = 
getRevivedAndDeletedKeysFromMergedLogs(dataTableMetaClient, instantTime, 
allLogFilePaths, finalWriterSchemaOpt,
-                  currentLogFilePaths, partitionPath, 
readerContextFactory.getContext());
+                  currentLogFilePaths, partitionPath, 
readerContextFactory.getContext(), metadataConfig);

Review Comment:
   I think in general we need to verify which config to use in this class. We 
are reading data files here from what I checked.



##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/client/functional/TestMetadataUtilRLIandSIRecordGeneration.java:
##########
@@ -728,6 +728,7 @@ Set<String> getRecordKeys(String partition, String 
baseInstantTime, String fileI
           .withHoodieTableMetaClient(datasetMetaClient)
           .withProps(properties)
           .withEmitDelete(true)
+          // revisit if this needs to enable log optimized scan

Review Comment:
   Is something need to be done here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9590] Enable OptimizedLogBlockScan support for FileGroup Reader and LogRecordScanner [hudi]

Reply via email to