milenkovicm commented on issue #1359:
URL:
https://github.com/apache/datafusion-ballista/issues/1359#issuecomment-3732471645
> * Im curious about optimizations like two-phased aggregations, which would
span stages (local agg in one, global in the next). I’ve seen limited examples
of adaptive two-phase aggregations where the local agg can be disabled in the
presence of a large keyspace. Have you given any thought to how the design
might allow/disallow future extensions for rules that span > 1 stage and aim to
adapt a running stage? Sounds like such rules wouldn’t be possible with a pass
after each stage, which is fine, but wanted to make that thought explicit.
I believe
`datafusion.execution.skip_partial_aggregation_probe_ratio_threshold` controls
this, in ballista case this would mean it will just write everything to shuffle
file. Not sure if we could make some kind of optimizer rule which could use
cardinality from previous stage to help with this.
Anyway, good part of work to make it easier to experiment with this things
> * For observability sake, do we plan to log out or keep in memory each
round of optimizations and the original execution graph?
At the moment prototype holds whole plan and traverse whole plane every
after every stage. It may be good for observability but it may affect planning
time, thus it may not be the best way forward. At the moment we can keep it as
it is until we get it running and then consider how to handle planning and
observability. I do agree that is a bit of a challenge to understand what
happened with plan.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]