rdettai opened a new pull request #8658:
URL: https://github.com/apache/arrow/pull/8658


   > This happens when executing a DataFusion query plan with hash aggregation 
where the data source is not ready on the first call by the Executor, and the 
async state machine is passed to a pending state
   > 
   > In the Stream implem of GroupedHashAggregateStream and 
HashAggregateStream, the state is set to self.finished = true on the first call 
to poll_next(). If the inner stream is Poll::Pending on the first call, this 
means that the next call resolves to Poll::Ready(None), thus finishing the 
stream instead of actually consuming the inner data.
   > 
   > I think that it does not happen with most current sources because they 
never trigger the Poll::Pending state. Parquet is implemented with a blocking 
call inside poll_next() (which is also problematic but an other issue), Memory 
yields directly, and CSV also always yields Poll::Ready
   > 
   > An analysis should be performed on all physical plans to check if the 
issue occurs in other places.
   
   
   https://issues.apache.org/jira/browse/ARROW-10577


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to