zhang-yue1 commented on issue #14297:
URL: https://github.com/apache/hudi/issues/14297#issuecomment-3551653968
> Thanks for the feedback, did you try to disable the Tez and use mr engine
instead?
I checked the issue and found that the duplicate columns happen in the
following class:
org.apache.hudi.hadoop.HoodieParquetInputFormatBase
Specifically, the line causing this is:
return HoodieAvroUtils.generateProjectionSchema(
tableSchema,
Arrays.stream(jobConf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR).split(","))
.filter(c -> !partitionColumns.contains(c))
.collect(Collectors.toList()));
The READ_COLUMN_NAMES_CONF_STR includes all columns from the query,
including those used in the WHERE clause.
Since the filter only removes partition columns, other columns used in
conditions are still included.
This results in columns being read twice, which can cause duplication
downstream.
The root cause is that Hoodie currently does not deduplicate the list of
columns after combining with the projection, so any column referenced in the
filter (non-partition) will appear twice if already present in the table schema.
I tried adding .distinct() to the stream, and the duplication issue no
longer occurs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]