baiguoname commented on issue #18381:
URL: https://github.com/apache/datafusion/issues/18381#issuecomment-3507291146

   > [@baiguoname](https://github.com/baiguoname) do let us know if the above 
suggestions work so that we can bring the issue to meaningful closure
   
   I thought that the previous suggestions would not work. Here is a more 
detailed scenario I'm describing:
   
   I have two `RecordBatch`es:
   
   `RecordBatch1` :
   ```
   code   value
   "A"      0.1
   "A"      0.2
   "C"      0.3
   "B"      0.4
   ```
   
   `RecordBatch1`:
   ```
   code   value
   "A"      0.5
   "D"      0.6
   ```
   
   And I have a `DataFrame` named `df` that receives the following 
`RecordBatch`es stream:
   
   `Poll::Ready(Some(RecordBatch1))` -> `Poll::Ready(None)` -> 
`Poll::Ready(Some(RecordBatch2))` -> `Poll::Ready(None)`
   
   Suppose there is a method called `collect_but_not_consume` on `df` that 
`collect` the `df` without consume the it. When I call this method: 
   
   ```rust
   let df = df
        .aggregate(
         vec![col("code")],
          min(vec![col"value"]).alias("mean"),
        )?
       .sort(vec![col("code").sort(true, true))?;
   let stream = df.collect_but_not_consume()
       
   ```
   
   The output from the stream would be:
   1. For the `Poll::Ready(Some(RecordBatch1))` ,  since the method behaves 
like `collect`, there will be no output.
   2. For the first `Poll::Ready(None)`, the stream will `collect` as normal 
but not consume the `stream`, so the `stream` continues to receive  
`RecordBatch`es from its children. For the operator `aggregate`,  the `min`  
accumulator maintain its state for future reuse.  For the operator `sort`, as a 
`pipline breaker`, it won't retain history data but will only sort on 
`RecordBatch1`.
   The ouput:
   ```
   code   value
   "A"      0.1
   "B"      0.4
   "C"      0.3
   ```
   3. For the  `Poll::Ready(Some(RecordBatch1))`, no output.
   4. For the second `Poll::Ready(None)`
   The ouput:
   ```
   code  value
   "A"    0.1
   "D"    0.6
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to