szehon-ho commented on PR #15470: URL: https://github.com/apache/iceberg/pull/15470#issuecomment-4636432500
Thanks for tracking this down — the diagnosis is spot on and the fix is correct. My one substantive concern is about how the inline approach changes parallelism and I/O, and I think there's a not-too-complex alternative worth considering. **Concern: unit of parallelism shifts from delete-file to manifest.** By folding the byte rewrite into the per-manifest task, two things change versus the previous flow: 1. **Parallelism is now bounded by manifest count, not delete-file count.** Within a task, the entries are rewritten with a sequential stream, so a single delete manifest referencing many delete files becomes one hot task doing all that byte-rewriting serially. Previously `rewritePositionDeletes` repartitioned by (deduped) delete-file count, fanning the work across the cluster. A spec-legal table with a small number of large delete manifests could regress noticeably here. 2. **A delete file referenced from manifests in different tasks is rewritten once per task** (the per-task UUID staging is what makes that safe). The copy plan is a `Set`, but those entries have different sources and the same target, so they aren't deduped — redundant rewrite + redundant copy to the same destination. **Suggested alternative: keep two phases, but propagate the measured size.** The reason the original code had the bug wasn't the two-phase structure — it was that the rewrite phase never fed the new size back into the manifest. We can keep the parallel/deduped rewrite and just carry the sizes forward: 1. Enumerate distinct delete files referenced by the manifests being rewritten (metadata-only). 2. Rewrite delete-file bytes in parallel, deduped (repartition by delete-file count), emitting `path -> actual size`; collect into a small map and broadcast it. 3. Rewrite manifests (repartition by manifest count, metadata-only) and stamp `file_size_in_bytes` from the broadcast map. This restores cluster-wide parallelism on the expensive step, rewrites each delete file exactly once (no wasted I/O), and still fixes the bug. It mostly reuses the existing `rewritePositionDeleteFile`; the only thing flowing between phases is a `Map<String, Long>`. Costs are an extra metadata pass over manifests (cheap relative to byte rewrites) and broadcasting the size map (fine unless the delete-file count is enormous, in which case a join would be the fallback). The HDFS in-progress-write race you mention doesn't apply here since the file is closed before its size is read in a separate phase. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
