raghusrhyme opened a new issue, #18915:
URL: https://github.com/apache/hudi/issues/18915

   ### Bug Description
   
   **What happened:**
   Custom JAR transformers that do column projection (e.g. `ColumnFilter` with 
`mode=include`) drop `_corrupt_record` since they are unaware of the 
error-table contract. `ErrorTableAwareChainedTransformer` calls `validate()` 
after every transformer in the chain, throwing `HoodieValidationException: 
Invalid condition, columnName=_corrupt_record is not present in transformer 
output schema`.
   
   **What you expected:**
   Pipeline should complete successfully — `_corrupt_record` should be 
re-injected if a transformer drops it.
   
   **Steps to reproduce:**
   1. Enable error table (`hoodie.errortable.enable=true`)
   2. Configure a custom transformer that does 
`dataset.select(explicitColumns)` (not including `_corrupt_record`)
   3. Run HoodieStreamer
   
   ### Environment
   
   **Hudi version:** master (0.16.0-SNAPSHOT)
   **Query engine:** Spark 3.5
   **Relevant configs:** `hoodie.errortable.enable=true`, 
`hoodie.errortable.write.class=<any concrete ErrorTableWriter>`
   
   ### Logs and Stack Trace
   
   ```
   org.apache.hudi.exception.HoodieValidationException: Invalid condition, 
columnName=_corrupt_record is not present in transformer output schema
     at 
org.apache.hudi.utilities.streamer.ErrorTableUtils.validate(ErrorTableUtils.java:88)
     at 
org.apache.hudi.utilities.transform.ErrorTableAwareChainedTransformer.apply(ErrorTableAwareChainedTransformer.java:59)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to