paultmathew commented on issue #2152:
URL:
https://github.com/apache/iceberg-python/issues/2152#issuecomment-4393959745
PR up at #3335. Plan to land streaming in three reviewable PRs to keep diffs
scoped:
1. **PR1 — #3335 (this thread).** `Transaction.append/overwrite` accept
`pa.RecordBatchReader`. Unpartitioned only. Microbatched by
`write.target-file-size-bytes` via the new `bin_pack_record_batches` helper,
files committed in one snapshot via `fast_append`. Memory bound: `N_workers ×
target_file_size`. Two semantics caveats called out in docstrings: (a)
`target_file_size` is currently uncompressed in-memory Arrow bytes (matches
existing `bin_pack_arrow_table`), and (b) `RecordBatchReader` is single-pass so
retry is the caller's responsibility.
2. **PR2 — follow-up.** Switch the streaming internals to a rolling
`pq.ParquetWriter` + `OutputStream.tell()` (now possible thanks to #2998).
Drops peak memory from `N_workers × target_file_size` to roughly one batch per
worker, and makes `write.target-file-size-bytes` reflect actual on-disk
compressed bytes (matches Java/Spark/Flink). No public API change.
3. **PR3 — partitioned streaming.** Genuinely the harder case. Open design
questions I'd love input on **before** I start coding:
- **Partition cardinality**: a streaming reader with high-cardinality
partition columns implies many concurrent writers (one per partition value).
Bound it via `max_open_files`-style spill, or sort-then-stream, or pushdown to
caller? iceberg-go #369 punted on this and partitioned streaming hasn't
followed there yet either.
- **Crash/retry idempotency**: rolling writes per partition value commit
metadata only at end-of-stream; partial crashes leave orphans. Today's pa.Table
path is naturally transactional because all files are written before commit.
The streaming path inverts that — worth being explicit about whether we accept
orphan-data risk in exchange for streaming, or build cleanup into the write
path.
This staging mirrors iceberg-go #369's — they shipped unpartitioned first
and partitioned hasn't followed yet for the same design reasons. Happy to
reorder if maintainers prefer otherwise.
Also pulled out a small companion test-state-isolation fix in #3334 (noticed
while running the integration suite repeatedly during this work) — independent
of this PR but worth landing first to clean up `make test-integration-exec` for
everyone.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]