szehon-ho commented on PR #15470:
URL: https://github.com/apache/iceberg/pull/15470#issuecomment-4636432500

   Thanks for tracking this down — the diagnosis is spot on and the fix is 
correct. My one substantive concern is about how the inline approach changes 
parallelism and I/O, and I think there's a not-too-complex alternative worth 
considering.
   
   **Concern: unit of parallelism shifts from delete-file to manifest.**
   
   By folding the byte rewrite into the per-manifest task, two things change 
versus the previous flow:
   
   1. **Parallelism is now bounded by manifest count, not delete-file count.** 
Within a task, the entries are rewritten with a sequential stream, so a single 
delete manifest referencing many delete files becomes one hot task doing all 
that byte-rewriting serially. Previously `rewritePositionDeletes` repartitioned 
by (deduped) delete-file count, fanning the work across the cluster. A 
spec-legal table with a small number of large delete manifests could regress 
noticeably here.
   2. **A delete file referenced from manifests in different tasks is rewritten 
once per task** (the per-task UUID staging is what makes that safe). The copy 
plan is a `Set`, but those entries have different sources and the same target, 
so they aren't deduped — redundant rewrite + redundant copy to the same 
destination.
   
   **Suggested alternative: keep two phases, but propagate the measured size.**
   
   The reason the original code had the bug wasn't the two-phase structure — it 
was that the rewrite phase never fed the new size back into the manifest. We 
can keep the parallel/deduped rewrite and just carry the sizes forward:
   
   1. Enumerate distinct delete files referenced by the manifests being 
rewritten (metadata-only).
   2. Rewrite delete-file bytes in parallel, deduped (repartition by 
delete-file count), emitting `path -> actual size`; collect into a small map 
and broadcast it.
   3. Rewrite manifests (repartition by manifest count, metadata-only) and 
stamp `file_size_in_bytes` from the broadcast map.
   
   This restores cluster-wide parallelism on the expensive step, rewrites each 
delete file exactly once (no wasted I/O), and still fixes the bug. It mostly 
reuses the existing `rewritePositionDeleteFile`; the only thing flowing between 
phases is a `Map<String, Long>`. Costs are an extra metadata pass over 
manifests (cheap relative to byte rewrites) and broadcasting the size map (fine 
unless the delete-file count is enormous, in which case a join would be the 
fallback). The HDFS in-progress-write race you mention doesn't apply here since 
the file is closed before its size is read in a separate phase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to