Hey Folks,

  I would like to raise awareness of a change I've been working on in
PR #15590 and see if anyone is interested to take a look.

To start with a problem, when a single snapshot adds a large number of
data files, our MergingSnapshotProducer will accumulate all of them in
memory before the commit. This unbounded collection can lead to memory
pressures or OOM failure during large writes. Think of full table
compaction, CTAS on a table with wide columns and stats.

Inspired by the rolling manifest writer, my approach to the problem is
to automatically flush the accumulated data files to the manifest
files once the count reaches a given threshold and now defaults to
100k. This helps spread the manifest writing I/O over the course of
add operation and puts a fixed ceiling on the peak memory during
commit. Flushed manifests are lazily read back for validation, and
covered for commit retry and clean up on failure by current mechanism.
For any commits adding fewer than 100k entries in a single snapshot,
there's no behavior change.

However, the default of 100k interacts with existing
MIN_FILE_GROUP_SIZE of 10k to control how files are grouped for
manifest writing, this effectively limits the manifest write
parallelism to 10, even on the hardware with more available cores. I
ran an appendFiles benchmark locally indicating 18% latency increased
for appending 1M files (12s -> 14.1s) in exchange for the flat memory
ceiling.

Would appreciate the thoughts and feedback on if there's a better way
to solve this problem. Also whether or not we shall allow this
threshold to be configured or balanced more adaptively.

Thanks,
Hongyue Zhang

link to PR: https://github.com/apache/iceberg/pull/15590

Reply via email to