Re: [PR] [HUDI-8289] Unmerged log scanner deprecation [hudi]

via GitHub Wed, 04 Jun 2025 21:14:33 -0700


the-other-tim-brown commented on code in PR #13383:
URL: https://github.com/apache/hudi/pull/13383#discussion_r2127900320



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -1701,31 +1677,38 @@ private static 
List<HoodieColumnRangeMetadata<Comparable>> readColumnRangeMetada
    * Read column range metadata from log file.
    */
   @VisibleForTesting
-  public static List<HoodieColumnRangeMetadata<Comparable>> 
getLogFileColumnRangeMetadata(String filePath, HoodieTableMetaClient 
datasetMetaClient,
+  public static List<HoodieColumnRangeMetadata<Comparable>> 
getLogFileColumnRangeMetadata(String filePath, String partitionPath,
+                                                                               
           HoodieTableMetaClient datasetMetaClient,
                                                                                
           List<String> columnsToIndex, Option<Schema> writerSchemaOpt,
                                                                                
           int maxBufferSize) throws IOException {
     if (writerSchemaOpt.isPresent()) {
       List<Pair<String, Schema.Field>> fieldsToIndex = 
columnsToIndex.stream().map(fieldName -> 
HoodieAvroUtils.getSchemaForField(writerSchemaOpt.get(), fieldName))
           .collect(Collectors.toList());
       // read log file records without merging

Review Comment:
   The code on master currently does not merge the records of the log files. By 
default, the dedupe option is set to true for upserts so you will be deduping 
by default and the documentation mentions you should only disable this if the 
records are unique to avoid duplicates on the data. So maybe this is why? The 
impact to the column stats is only that the range may be wider than required. 
Merging will require all records to be part of a map which increases the memory 
overhead whereas the un-merged option just iterates through the records.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-8289] Unmerged log scanner deprecation [hudi]

Reply via email to