wombatu-kun commented on code in PR #18914:
URL: https://github.com/apache/hudi/pull/18914#discussion_r3411377231
##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ErrorTableAwareChainedTransformer.java:
##########
@@ -55,8 +55,9 @@ public Dataset<Row> apply(JavaSparkContext jsc, SparkSession
sparkSession, Datas
for (TransformerInfo transformerInfo : transformers) {
Transformer transformer = transformerInfo.getTransformer();
dataset = transformer.apply(jsc, sparkSession, dataset,
transformerInfo.getProperties(properties, transformers));
- // validate in every stage to ensure ErrorRecordColumn not dropped by
one of the transformer and added by next transformer.
- ErrorTableUtils.validate(dataset);
+ // Re-inject _corrupt_record if the transformer dropped it (e.g. custom
JAR transformers
+ // that do column projection like ColumnFilter with mode=include).
+ dataset =
ErrorTableUtils.addNullValueErrorTableCorruptRecordColumn(dataset);
Review Comment:
Confirmed reachable in production: StreamSync applies the chain then calls
processErrorEvents with CUSTOM_TRANSFORMER_FAILURE, and that extraction in
SourceFormatAdapter keys off _corrupt_record being non-null. If an earlier
transformer marks rows and a later one projects the column away, re-injecting
as null here makes every row match the isNull filter in processErrorEvents, so
the marked rows flow into the main write path - not just dropped from the error
table but silently written to the target table. The new test
testCorruptRecordReInjectedAfterTransformerDropsIt sets up exactly this
populate-then-drop case (t1 marks, t2 drops) yet only asserts column presence
and count, so it passes whether the marked data survives or not. Safer to
extract the column before re-injecting, or at minimum WARN when a non-null
column is dropped and assert the row outcome in that test.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]