sumedhsakdeo opened a new pull request, #3045:
URL: https://github.com/apache/iceberg-python/pull/3045
# Rationale for this change
Closes #3036
## Summary
Adds a read throughput micro-benchmark to measure records/sec and peak
Arrow memory
across `streaming` and `concurrent_files` configurations introduced in PRs
0-2.
## Synthetic Data
- **Files**: 32 Parquet files, 500,000 rows each (16M total rows)
- **Schema**: 5 columns — `id` (int64), `value` (float64), `label`
(string), `flag` (bool), `ts` (timestamp[us, tz=UTC])
- **Batch size**: PyArrow default of 131,072 rows per batch (~4 batches
per file)
- **Setup**: Session-scoped fixture creates a SqlCatalog + table, writes
and appends all 32 files once
- **Memory tracking**: `pa.total_allocated_bytes()` (PyArrow C++ memory
pool, not Python heap via tracemalloc)
- **Runs**: 3 iterations per config, reports mean ± stdev
## Configurations (6 parameterized tests)
All tests use PyArrow's default `batch_size=131,072`. The variable under
test is the concurrency model:
| ID | Mode | Description |
|---|---|---|
| default | `streaming=False` | Current behavior — `executor.map` +
`list()`, all files in parallel |
| streaming-cf1 | `streaming=True, concurrent_files=1` | Sequential
streaming, one file at a time |
| streaming-cf2 | `streaming=True, concurrent_files=2` | Bounded
concurrent streaming, 2 files |
| streaming-cf4 | `streaming=True, concurrent_files=4` | Bounded
concurrent streaming, 4 files |
| streaming-cf8 | `streaming=True, concurrent_files=8` | Bounded
concurrent streaming, 8 files |
| streaming-cf16 | `streaming=True, concurrent_files=16` | Bounded
concurrent streaming, 16 files |
## Benchmark Results (local SSD, macOS, 16-core, Python 3.13)
| Config | Throughput (rows/s) | Time (s) | Peak Arrow Mem (MB) |
|---|---|---|---|
| `default` (executor.map, all files parallel) | 196M | 0.08 ± 0.02 | 637 |
| `streaming, concurrent_files=1` | 60M | 0.27 ± 0.00 | **10** |
| `streaming, concurrent_files=2` | 107M | 0.15 ± 0.00 | **42** |
| `streaming, concurrent_files=4` | 178M | 0.09 ± 0.00 | **114** |
| `streaming, concurrent_files=8` | **225M** | 0.07 ± 0.00 | 269 |
| `streaming, concurrent_files=16` | **222M** | 0.07 ± 0.00 | 479 |
### Key observations
- **`concurrent_files=1` reduces peak memory 63x** (637 MB → 10 MB) —
processes one file at a time, ideal for memory-constrained environments
- **`concurrent_files=4` matches default throughput** (178M vs 196M
rows/s) at **82% less memory** (114 MB vs 637 MB)
- **`concurrent_files=8` beats default by 15%** (225M vs 196M rows/s) at
**58% less memory** (269 MB vs 637 MB) — the sweet spot on this hardware
- **`concurrent_files=16` plateaus at `concurrent_files=8`** — on local
SSD, GIL contention and memory bandwidth become the bottleneck rather than IO.
On network storage (S3/GCS) where IO latency dominates, higher concurrency
values are
expected to scale further
- Memory scales linearly with `concurrent_files`, giving users a
predictable knob to trade memory for throughput
## How to run
```bash
uv run pytest tests/benchmark/test_read_benchmark.py -v -s -m benchmark
Are these changes tested?
Yes — this PR is a benchmark test itself (6 parameterized test cases).
Are there any user-facing changes?
No — benchmark infrastructure only
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]