[
https://issues.apache.org/jira/browse/ARROW-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ARROW-10577:
-----------------------------------
Labels: pull-request-available (was: )
> [Rust][DataFusion] Hash Aggregator stream finishes unexpectedly after going
> to Pending state
> --------------------------------------------------------------------------------------------
>
> Key: ARROW-10577
> URL: https://issues.apache.org/jira/browse/ARROW-10577
> Project: Apache Arrow
> Issue Type: Bug
> Components: Rust, Rust - DataFusion
> Affects Versions: 2.0.0
> Reporter: Remi Dettai
> Priority: Blocker
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> This happens when executing a DataFusion query plan with hash aggregation
> where the data source is not ready on the first call by the Executor, and the
> async state machine is passed to a _pending_ state
> In the {{Stream}} implem of {{GroupedHashAggregateStream}} and
> {{HashAggregateStream}}, the state is set to {{self.finished = true}} on the
> first call to {{poll_next()}}. If the inner stream is {{Poll::Pending}} on
> the first call, this means that the next call resolves to
> {{Poll::Ready(None)}}, thus finishing the stream instead of actually
> consuming the inner data.
> I think that it does not happen with most current sources because they never
> trigger the {{Poll::Pending}} state. Parquet is implemented with a blocking
> call inside {{poll_next()}} (which is also problematic but an other issue),
> Memory yields directly, and CSV also always yields {{Poll::Ready}}
> An analysis should be performed on all physical plans to check if the issue
> occurs in other places.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)