vustef opened a new pull request, #1871:
URL: https://github.com/apache/iceberg-rust/pull/1871

   ## Which issue does this PR close?
   
   https://relationalai.atlassian.net/browse/RAI-44217
   
   ## What changes are included in this PR?
   
   Adding true parallelism here. By default, full scan (in `reader.rs`) has 
only concurrency, but not parallelism. For the negative impact of that, see 
this closed draft's description: 
https://github.com/apache/iceberg-rust/pull/1684. While upstream the design is 
such that parallelism should be baked in the upper layers, since we diverge at 
the moment, we may do this hack to enable parallelism for us right away.
   What are the issues?
   When `FileScanTaskStream` is processed, we process it concurrently, however, 
without spawning, there's no parallelism. The impact of not spawning for each 
stream item here is minimal though, as operations are IO-heavy, and concurrency 
is nearly enough. However, the output of processing each file in the file 
stream is a record batch stream. And processing record batch stream is 
CPU-heavy operation. Right now we process that concurrently as well (with 
`try_flatten_unordered(N)`), but that is not the enough - for CPU-heavy work we 
definitely need parallelism.
   
   So what can we do?
   First, we can create a channel, spawn, and return receiver side of the 
channel. In the spawned task, we can populate the transmitter side. Here we 
have two options:
   1. Spawn for each file
   2. Don't spawn, since these are IO-bound operations.
   I chose to spawn, with the idea of squeezing parallelism. In some cases it's 
going to add more latency though, and we may make this an option (or decide for 
different default here in the PR).
   Then for each file, we need to process batches in the `record_batch_stream`. 
Since this is happening in the spawned task already (if we choose option 2 
above, we should at least spawn around processing `record_batch_stream`), 
CPU-heavy operation will be parallelized. But if we only have one file, 
processing its batches won't be parallelized. So again we have two strategies:
   1. Iterate through batches without spawning per batch
   2. Spawn per batch
   I chose to spawn per batch in this PR.
   
   For incremental streams, we already spawn for each file. However, we don't 
spawn per batch. To align these two, I'm modifying that code to spawn per batch 
too.
   
   ## Are these changes tested?
   
   Existing tests go through these code paths. Haven't tested performance yet, 
as that is a manual process on EC2 instances.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to