Re: [I] hudi version 1.0.2 AvroRuntimeException: Duplicate field .. in Hudi record [hudi]

via GitHub Wed, 19 Nov 2025 01:19:54 -0800


zhang-yue1 commented on issue #14297:
URL: https://github.com/apache/hudi/issues/14297#issuecomment-3551653968


   > Thanks for the feedback, did you try to disable the Tez and use mr engine 
instead?
   
   I checked the issue and found that the duplicate columns happen in the 
following class:
   
   org.apache.hudi.hadoop.HoodieParquetInputFormatBase
   
   
   Specifically, the line causing this is:
   
   return HoodieAvroUtils.generateProjectionSchema(
           tableSchema,
           
Arrays.stream(jobConf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR).split(","))
                 .filter(c -> !partitionColumns.contains(c))
                 .collect(Collectors.toList()));
   
   
   The READ_COLUMN_NAMES_CONF_STR includes all columns from the query, 
including those used in the WHERE clause.
   
   Since the filter only removes partition columns, other columns used in 
conditions are still included.
   
   This results in columns being read twice, which can cause duplication 
downstream.
   
   The root cause is that Hoodie currently does not deduplicate the list of 
columns after combining with the projection, so any column referenced in the 
filter (non-partition) will appear twice if already present in the table schema.
   
   I tried adding .distinct() to the stream, and the duplication issue no 
longer occurs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] hudi version 1.0.2 AvroRuntimeException: Duplicate field .. in Hudi record [hudi]

Reply via email to