yihua commented on code in PR #14337:
URL: https://github.com/apache/hudi/pull/14337#discussion_r2557213107
##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileGroupReaderBasedRecordReader.java:
##########
@@ -330,6 +330,9 @@ private static Schema createRequestedSchema(Schema
tableSchema, JobConf jobConf)
// if they are actually written to the file, then it is ok to read them
from the file
tableSchema.getFields().forEach(f ->
partitionColumns.remove(f.name().toLowerCase(Locale.ROOT)));
return HoodieAvroUtils.generateProjectionSchema(tableSchema,
-
Arrays.stream(jobConf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR).split(",")).filter(c
-> !partitionColumns.contains(c)).collect(Collectors.toList()));
+ // The READ_COLUMN_NAMES_CONF_STR includes all columns from the query,
including those used in the WHERE clause,
+ // so any column referenced in the filter (non-partition) will appear
twice if already present in the project schema,
+ // here distinct() is used here to deduplicate the read columns.
+
Arrays.stream(jobConf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR).split(",")).filter(c
-> !partitionColumns.contains(c)).distinct().collect(Collectors.toList()));
Review Comment:
Could we add a unit test on this? Do the duplicated columns come from Hive
or our logic of setting `ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR`?
Does `distinct()` change the ordering of the columns in the original column
list?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]