Re: [PR] fix: Fix duplicate field exception in hive query with where clause [hudi]

via GitHub Mon, 24 Nov 2025 18:01:28 -0800


cshuo commented on code in PR #14337:
URL: https://github.com/apache/hudi/pull/14337#discussion_r2558295718



##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileGroupReaderBasedRecordReader.java:
##########
@@ -330,6 +330,9 @@ private static Schema createRequestedSchema(Schema 
tableSchema, JobConf jobConf)
     // if they are actually written to the file, then it is ok to read them 
from the file
     tableSchema.getFields().forEach(f -> 
partitionColumns.remove(f.name().toLowerCase(Locale.ROOT)));
     return HoodieAvroUtils.generateProjectionSchema(tableSchema,
-        
Arrays.stream(jobConf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR).split(",")).filter(c
 -> !partitionColumns.contains(c)).collect(Collectors.toList()));
+        // The READ_COLUMN_NAMES_CONF_STR includes all columns from the query, 
including those used in the WHERE clause,
+        // so any column referenced in the filter (non-partition) will appear 
twice if already present in the project schema,
+        // here distinct() is used here to deduplicate the read columns.
+        
Arrays.stream(jobConf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR).split(",")).filter(c
 -> !partitionColumns.contains(c)).distinct().collect(Collectors.toList()));

Review Comment:
   it's possible the value for READ_COLUMN_NAMES_CONF_STR from hive contains 
duplicate columns, I noticed there is similar deduplicate logic in hive-iceberg 
integration. 
https://github.com/apache/hive/blob/1a48853e946ad1c9219a34835d3fe917eba1a756/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java#L164
   
   > Does distinct() change the ordering of the columns in the original column 
list?
   
   No, the order of columns remain same as before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix: Fix duplicate field exception in hive query with where clause [hudi]

Reply via email to