voonhous commented on PR #19029:
URL: https://github.com/apache/hudi/pull/19029#issuecomment-4757665668

   Two more correctness things after a closer look, both about the fix being 
narrower/less durable than it first looks:
   
   1. Durability across engines. This corrects the table schema, but existing 
parquet files still carry the old `timestamp-micros` label with millis values - 
only Hudi's own reader applies the #14161 repair. External readers 
(Trino/Presto, Athena, BigQuery external, Spark-native parquet) will misread 
those files until they're physically rewritten. So the fix is only durable for 
non-Hudi engines after the affected base files get rewritten via 
clustering/compaction under the corrected schema. Worth calling out in the doc, 
with "rewrite the files" as the second migration step.
   
   2. The flag is ignored on some write paths. Tracing `deduceWriterSchema`: 
it's honored on the internal-schema reconcile path (schema-on-read / 
`saveInternalSchema`), the MOR merge handles, and the Spark default path only 
when `setNullForMissingColumns=true`. It's *not* consulted for `MERGE INTO` 
writes or when `hoodie.avro.schema.validate=true` (both take the else branch in 
`deduceWriterSchemaWithoutReconcile`, which doesn't reconcile), nor on the 
legacy `reconcileSchemasLegacy` path (uses `shouldPromoteType`, hardcoded 
false). Consistent with the PR's scope, but a user who sets the flag and writes 
via `MERGE INTO` will see no effect and no error. Might be worth documenting 
which paths honor it, and whether `MERGE INTO` needs covering - the current 
tests only hit the sync path.
   
   Lower priority, didn't verify: column stats / data-skipping bounds for the 
relabeled column were computed under the old logical type - worth a sanity 
check that pruning stays correct after the relabel.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to