andygrove commented on issue #23194: URL: https://github.com/apache/datafusion/issues/23194#issuecomment-4848793652
> > Now that AQE/AQP has been fleshed out with an implementation and substantive discussion, I'm curious to know if you still prefer: > > My personal preference is to see AQP implemented outside the core datafusion crate / repo, be validated that it works, and is broadly applicable, and then potentially bring it into the core to be maintained along with the other code here It makes sense to prove this out in real distributed consumers first. We are already doing that in Ballista, and @gabotechs is in datafusion-distributed. The friction for Ballista isn't the AQE logic. It is that core DataFusion increasingly assumes a plan runs in one process, and that breaks the moment a plan is split at a pipeline breaker, serialized, and executed one partition per task. Three we hit just upgrading to DF 54 (apache/datafusion-ballista#1906): - `DataSourceExec`'s shared scan work queue is only divided correctly when all partitions are polled together. A task polling its one partition drained the queue and read the whole table (https://github.com/apache/datafusion-ballista/issues/1907). - An uncorrelated scalar subquery's `ScalarSubqueryExpr` only decodes inside its `ScalarSubqueryExec`. Once stage splitting separates them, the stage no longer round trips (https://github.com/apache/datafusion-ballista/issues/1909). - Runtime dynamic filter pushdown delivers a filter from build side to probe scan within one plan instance. Split across a stage boundary it never arrives and the probe stalls. I'll start filing issues in core DF when I hit things like this, that are regressions from Ballista's point of view, to help (hopefully) with the conversation. My own preference is that modeling some of this in core (eventually) would help, because it lets core catch these regressions before a release rather than after. All three shipped in DF 54 and only surfaced once we upgraded. Perhaps this is a new crate in core that models a distributed-style approach, but still in-process, jut to catch these kind of issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
