sumedhsakdeo opened a new pull request, #3045:
URL: https://github.com/apache/iceberg-python/pull/3045

   # Rationale for this change
   
     Closes #3036
   
     ## Summary
   
     Adds a read throughput micro-benchmark to measure records/sec and peak 
Arrow memory
     across `streaming` and `concurrent_files` configurations introduced in PRs 
0-2.
   
     ## Synthetic Data
   
     - **Files**: 32 Parquet files, 500,000 rows each (16M total rows)
     - **Schema**: 5 columns — `id` (int64), `value` (float64), `label` 
(string), `flag` (bool), `ts` (timestamp[us, tz=UTC])
     - **Batch size**: PyArrow default of 131,072 rows per batch (~4 batches 
per file)
     - **Setup**: Session-scoped fixture creates a SqlCatalog + table, writes 
and appends all 32 files once
     - **Memory tracking**: `pa.total_allocated_bytes()` (PyArrow C++ memory 
pool, not Python heap via tracemalloc)
     - **Runs**: 3 iterations per config, reports mean ± stdev
   
     ## Configurations (6 parameterized tests)
   
     All tests use PyArrow's default `batch_size=131,072`. The variable under 
test is the concurrency model:
   
     | ID | Mode | Description |
     |---|---|---|
     | default | `streaming=False` | Current behavior — `executor.map` + 
`list()`, all files in parallel |
     | streaming-cf1 | `streaming=True, concurrent_files=1` | Sequential 
streaming, one file at a time |
     | streaming-cf2 | `streaming=True, concurrent_files=2` | Bounded 
concurrent streaming, 2 files |
     | streaming-cf4 | `streaming=True, concurrent_files=4` | Bounded 
concurrent streaming, 4 files |
     | streaming-cf8 | `streaming=True, concurrent_files=8` | Bounded 
concurrent streaming, 8 files |
     | streaming-cf16 | `streaming=True, concurrent_files=16` | Bounded 
concurrent streaming, 16 files |
   
     ## Benchmark Results (local SSD, macOS, 16-core, Python 3.13)
   
     | Config | Throughput (rows/s) | Time (s) | Peak Arrow Mem (MB) |
     |---|---|---|---|
     | `default` (executor.map, all files parallel) | 196M | 0.08 ± 0.02 | 637 |
     | `streaming, concurrent_files=1` | 60M | 0.27 ± 0.00 | **10** |
     | `streaming, concurrent_files=2` | 107M | 0.15 ± 0.00 | **42** |
     | `streaming, concurrent_files=4` | 178M | 0.09 ± 0.00 | **114** |
     | `streaming, concurrent_files=8` | **225M** | 0.07 ± 0.00 | 269 |
     | `streaming, concurrent_files=16` | **222M** | 0.07 ± 0.00 | 479 |
   
     ### Key observations
   
     - **`concurrent_files=1` reduces peak memory 63x** (637 MB → 10 MB) — 
processes one file at a time, ideal for memory-constrained environments
     - **`concurrent_files=4` matches default throughput** (178M vs 196M 
rows/s) at **82% less memory** (114 MB vs 637 MB)
     - **`concurrent_files=8` beats default by 15%** (225M vs 196M rows/s) at 
**58% less memory** (269 MB vs 637 MB) — the sweet spot on this hardware
     - **`concurrent_files=16` plateaus at `concurrent_files=8`** — on local 
SSD, GIL contention and memory bandwidth become the bottleneck rather than IO. 
On network storage (S3/GCS) where IO latency dominates, higher concurrency 
values are
     expected to scale further
     - Memory scales linearly with `concurrent_files`, giving users a 
predictable knob to trade memory for throughput
   
     ## How to run
   
     ```bash
     uv run pytest tests/benchmark/test_read_benchmark.py -v -s -m benchmark
   
     Are these changes tested?
   
     Yes — this PR is a benchmark test itself (6 parameterized test cases).
   
     Are there any user-facing changes?
   
     No — benchmark infrastructure only


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to