vustef opened a new pull request, #1871: URL: https://github.com/apache/iceberg-rust/pull/1871
## Which issue does this PR close? https://relationalai.atlassian.net/browse/RAI-44217 ## What changes are included in this PR? Adding true parallelism here. By default, full scan (in `reader.rs`) has only concurrency, but not parallelism. For the negative impact of that, see this closed draft's description: https://github.com/apache/iceberg-rust/pull/1684. While upstream the design is such that parallelism should be baked in the upper layers, since we diverge at the moment, we may do this hack to enable parallelism for us right away. What are the issues? When `FileScanTaskStream` is processed, we process it concurrently, however, without spawning, there's no parallelism. The impact of not spawning for each stream item here is minimal though, as operations are IO-heavy, and concurrency is nearly enough. However, the output of processing each file in the file stream is a record batch stream. And processing record batch stream is CPU-heavy operation. Right now we process that concurrently as well (with `try_flatten_unordered(N)`), but that is not the enough - for CPU-heavy work we definitely need parallelism. So what can we do? First, we can create a channel, spawn, and return receiver side of the channel. In the spawned task, we can populate the transmitter side. Here we have two options: 1. Spawn for each file 2. Don't spawn, since these are IO-bound operations. I chose to spawn, with the idea of squeezing parallelism. In some cases it's going to add more latency though, and we may make this an option (or decide for different default here in the PR). Then for each file, we need to process batches in the `record_batch_stream`. Since this is happening in the spawned task already (if we choose option 2 above, we should at least spawn around processing `record_batch_stream`), CPU-heavy operation will be parallelized. But if we only have one file, processing its batches won't be parallelized. So again we have two strategies: 1. Iterate through batches without spawning per batch 2. Spawn per batch I chose to spawn per batch in this PR. For incremental streams, we already spawn for each file. However, we don't spawn per batch. To align these two, I'm modifying that code to spawn per batch too. ## Are these changes tested? Existing tests go through these code paths. Haven't tested performance yet, as that is a manual process on EC2 instances. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
