rdettai opened a new pull request #8658: URL: https://github.com/apache/arrow/pull/8658
> This happens when executing a DataFusion query plan with hash aggregation where the data source is not ready on the first call by the Executor, and the async state machine is passed to a pending state > > In the Stream implem of GroupedHashAggregateStream and HashAggregateStream, the state is set to self.finished = true on the first call to poll_next(). If the inner stream is Poll::Pending on the first call, this means that the next call resolves to Poll::Ready(None), thus finishing the stream instead of actually consuming the inner data. > > I think that it does not happen with most current sources because they never trigger the Poll::Pending state. Parquet is implemented with a blocking call inside poll_next() (which is also problematic but an other issue), Memory yields directly, and CSV also always yields Poll::Ready > > An analysis should be performed on all physical plans to check if the issue occurs in other places. https://issues.apache.org/jira/browse/ARROW-10577 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
