Re: [PR] fix: Fix duplicate field exception in hive query with where clause [hudi]

via GitHub Mon, 24 Nov 2025 09:53:03 -0800


yihua commented on code in PR #14337:
URL: https://github.com/apache/hudi/pull/14337#discussion_r2557213107



##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileGroupReaderBasedRecordReader.java:
##########
@@ -330,6 +330,9 @@ private static Schema createRequestedSchema(Schema 
tableSchema, JobConf jobConf)
     // if they are actually written to the file, then it is ok to read them 
from the file
     tableSchema.getFields().forEach(f -> 
partitionColumns.remove(f.name().toLowerCase(Locale.ROOT)));
     return HoodieAvroUtils.generateProjectionSchema(tableSchema,
-        
Arrays.stream(jobConf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR).split(",")).filter(c
 -> !partitionColumns.contains(c)).collect(Collectors.toList()));
+        // The READ_COLUMN_NAMES_CONF_STR includes all columns from the query, 
including those used in the WHERE clause,
+        // so any column referenced in the filter (non-partition) will appear 
twice if already present in the project schema,
+        // here distinct() is used here to deduplicate the read columns.
+        
Arrays.stream(jobConf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR).split(",")).filter(c
 -> !partitionColumns.contains(c)).distinct().collect(Collectors.toList()));

Review Comment:
   Could we add a unit test on this?  Do the duplicated columns come from Hive 
or our logic of setting `ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR`?
   
   Does `distinct()` change the ordering of the columns in the original column 
list?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix: Fix duplicate field exception in hive query with where clause [hudi]

Reply via email to