vustef opened a new pull request, #1684: URL: https://github.com/apache/iceberg-rust/pull/1684
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. ## What changes are included in this PR? ### Problem The iceberg-rust arrow reader had a critical performance bottleneck where CPU-intensive parquet operations (decompression, decoding) were running on a single thread, despite multiple files being processed concurrently. **Root cause**: While `try_buffer_unordered()` enabled concurrent file processing, `try_flatten_unordered()` was polling each file's `ArrowRecordBatchStream` sequentially on the same thread. This meant that the CPU-heavy `ParquetRecordBatchStream::poll_next()` operations - which perform parquet decompression and decoding - were all bottlenecked on a single thread. **Impact**: Multi-core systems were severely underutilized when processing multiple parquet files, as evidenced by flamegraphs showing single-threaded execution in the critical parquet processing path. ### Solution Modified the arrow reader to ensure each file's parquet stream processing runs on dedicated tokio tasks: 1. Added `stream_to_receiver()` method: Wraps each `ArrowRecordBatchStream` in a `tokio::spawn()` task that handles the CPU-intensive polling. **Consideration**: perhaps we should use `spawn_blocking` 2. Updated `process_file_scan_task()`: Each file's stream is now wrapped with `stream_to_receiver()` before being returned 3. Preserved existing concurrency patterns: Uses the same `try_buffer_unordered()` + `try_flatten_unordered()` flow, but ensures the underlying polling scales across threads ### How it fixes the issue - Before: All parquet streams polled sequentially → single-threaded CPU bottleneck. TPCH lineitem scan on SF100 on r7i.8xlarge EC2 was taking 56s to scan. Network throughput: <img width="1044" height="278" alt="image" src="https://github.com/user-attachments/assets/20dfd1d2-6205-4ca5-9f3b-1fb5ca5c60a7" /> - After: Each parquet stream polled in its own tokio::spawn() task → true multi-threaded CPU utilization. TPCH lineitem scan now takes 20s. Network throughput: <img width="855" height="337" alt="image" src="https://github.com/user-attachments/assets/8c775d45-27e5-43a1-83f4-c7d963896187" /> The key insight was that try_flatten_unordered() only provides async concurrency for the futures that create streams, but the actual stream polling (where the CPU work happens) was still sequential. By moving each stream's polling to its own tokio task via stream_to_receiver(), we ensure that multiple files' CPU-intensive parquet operations can run simultaneously on different threads. **Additional considerations**: [ ] Should we use `spawn_blocking` in `stream_to_receiver`? [ ] Should we add a new configuration method for ScanBuilder to be able to limit this new concurrency? Or is it fine that it's the same as `concurrency_limit_data_files`? ## Are these changes tested? <!-- Specify what test covers (unit test, integration test, etc.). If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org