avantgardnerio commented on issue #23194: URL: https://github.com/apache/datafusion/issues/23194#issuecomment-4812304433
@gabotechs , I've reviewed dfd's architecture and your above proposal, and I think we can come to a common proposal that satisfies DF, dfd, and Ballista. Ideas: 1. The PoC's StageBoundary becomes the same shape as SamplerExec, parameterized by a single capacity N. Small N = sampler-shaped (bounded peek with NDV/null-pct/velocity); N=∞ = pipeline breaker (exact stats, gated on spill). Same operator, one knob: streaming-vs-eager collapses into a scalar (with other possible tricks down the road like memory-pressure awareness, spill, and upstream pipeline breaker awareness) 2. I think `SamplerExec` provides truly richer stats, I think we can build stats that are a superset of `LoadInfo` and the `PartitionStatistics` of today (adding PartitionSortExtrema while we are at it) 3. The StageBoundary/SamplerExec takes a listener: Option<Arc<dyn StatsListener>>, in the constructor. DF would use None, but upstream projects could get async notifications This way dfd's coordination layer stays completely intact — your oneshot→gRPC delivery becomes the StatsListener impl — and DF picks up the observation primitive natively. Do you think this would be a fruitful direction? If you'd like to merge in SamplerExec then DF would have the perfect slot for it. Or if you prefer, the work would be lifted directly with attribution. As an aside, I would like to correct a confusing sentence from before though - when I said: > What this PR hopes to do is add the necessary primitives within Datafusion I meant primitives + intra-DF-AQ-execution. This way multiple projects don't need to redo the work of AQE aware optimizations, the entire DF community would help in the AQE project. And what were previously single-core operations (windows without partitionby) within DF become parallelized within a worker/executor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
