Re: [PR] Add generate_series() udtf (and introduce 'lazy' `MemoryExec`) [datafusion]

via GitHub Wed, 27 Nov 2024 09:30:16 -0800


2010YOUY01 commented on PR #13540:
URL: https://github.com/apache/datafusion/pull/13540#issuecomment-2504046122


   Thank you for the comments @jayzhan211 , I have updated.
   
   Now I think the concurrent generator might not be very straightforward, I'll 
first leave some rationale here.
   If we can come to agreement I'll add more doc to it (else revert back to 
`Box` version for simplicity)
   
   ----
   ### Rationale for `StreamingMemoryStream` design
   Suppose we want to run `select ... from generate_series(1,100)` in two 
partitions
   And the underlying batch generator is wrapped with Mutex
   ```
   pub struct StreamingMemoryStream {
       ...
       generator: Arc<Mutex<dyn StreamingBatchGenerator>>,
   }
   ```
   It's possible to implement the UDTF in 3 ways:
   1.
   ```
   [generator_1_50]   --- [StreamingMemoryStream  stream1] --> xxStream1
   [generator_50_100] --- [StreamingMemoryStream  stream2] --> xxStream1
   ```
   2.
   ```
   [generator_1_100] --- [StreamingMemoryStream  stream1] --> Repartition --> 
xxStream1
                                                                          |-> 
xxStream2                                                                       
         
   ```
   3.
   ```
   [generator_1_100] --- [StreamingMemoryStream  stream1] --> xxStream1
                     |-- [StreamingMemoryStream  stream2] --> xxStream2         
                                                                                
                                             
   ```
   1 and 2 is the common pattern for datafusion scanning operators to do 
plan-time parallelism, `Mutex` is not used in this case
   3 make the `StreamingBatchGenerator` being able to concurrently accessed by 
multiple streams. It's trying to make the interface more general-purpose for 
future usage, and won't be used by `generate_series()`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add generate_series() udtf (and introduce 'lazy' `MemoryExec`) [datafusion]

Reply via email to