2010YOUY01 commented on PR #13540:
URL: https://github.com/apache/datafusion/pull/13540#issuecomment-2504046122
Thank you for the comments @jayzhan211 , I have updated.
Now I think the concurrent generator might not be very straightforward, I'll
first leave some rationale here.
If we can come to agreement I'll add more doc to it (else revert back to
`Box` version for simplicity)
----
### Rationale for `StreamingMemoryStream` design
Suppose we want to run `select ... from generate_series(1,100)` in two
partitions
And the underlying batch generator is wrapped with Mutex
```
pub struct StreamingMemoryStream {
...
generator: Arc<Mutex<dyn StreamingBatchGenerator>>,
}
```
It's possible to implement the UDTF in 3 ways:
1.
```
[generator_1_50] --- [StreamingMemoryStream stream1] --> xxStream1
[generator_50_100] --- [StreamingMemoryStream stream2] --> xxStream1
```
2.
```
[generator_1_100] --- [StreamingMemoryStream stream1] --> Repartition -->
xxStream1
|->
xxStream2
```
3.
```
[generator_1_100] --- [StreamingMemoryStream stream1] --> xxStream1
|-- [StreamingMemoryStream stream2] --> xxStream2
```
1 and 2 is the common pattern for datafusion scanning operators to do
plan-time parallelism, `Mutex` is not used in this case
3 make the `StreamingBatchGenerator` being able to concurrently accessed by
multiple streams. It's trying to make the interface more general-purpose for
future usage, and won't be used by `generate_series()`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]