cshuo commented on code in PR #14337:
URL: https://github.com/apache/hudi/pull/14337#discussion_r2558295718
##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileGroupReaderBasedRecordReader.java:
##########
@@ -330,6 +330,9 @@ private static Schema createRequestedSchema(Schema
tableSchema, JobConf jobConf)
// if they are actually written to the file, then it is ok to read them
from the file
tableSchema.getFields().forEach(f ->
partitionColumns.remove(f.name().toLowerCase(Locale.ROOT)));
return HoodieAvroUtils.generateProjectionSchema(tableSchema,
-
Arrays.stream(jobConf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR).split(",")).filter(c
-> !partitionColumns.contains(c)).collect(Collectors.toList()));
+ // The READ_COLUMN_NAMES_CONF_STR includes all columns from the query,
including those used in the WHERE clause,
+ // so any column referenced in the filter (non-partition) will appear
twice if already present in the project schema,
+ // here distinct() is used here to deduplicate the read columns.
+
Arrays.stream(jobConf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR).split(",")).filter(c
-> !partitionColumns.contains(c)).distinct().collect(Collectors.toList()));
Review Comment:
it's possible the value for READ_COLUMN_NAMES_CONF_STR from hive contains
duplicate columns, I noticed there is similar deduplicate logic in hive-iceberg
integration.
https://github.com/apache/hive/blob/1a48853e946ad1c9219a34835d3fe917eba1a756/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java#L164
> Does distinct() change the ordering of the columns in the original column
list?
No, the order of columns remain same as before.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]