thinkharderdev commented on issue #284:
URL: https://github.com/apache/arrow-ballista/issues/284#issuecomment-1258123350
👍
I would add a couple things to this discussion:
Currently partitioning is a bit restrictive in that either the client has to
specify the partitioning or we fall back to a single default partitioning. This
has several limitations
1. Partitioning should be sensitive to cluster capacity. Adding more
partitions is probably counter-productive if you don't have enough executor
capacity to execute the concurrently. All you are doing is adding scheduling
overhead and shuffle exchange cost.
2. The client probably has no information about the cluster capacity (which
may itself be dynamic based on auto-scaling rules) nor does it necessarily have
any information about how much data is scanned for a given query. So the client
would lack any relevant information for deciding what the appropriate number of
partitions are.
3. Allowing the client to specify the number of partitions can be
problematic for multi-tenancy since there is no way to prevent a greedy client
from consuming too many cluster resources.
There may not be an optimal general solution for all of this so I think it
might be useful to make this functionality pluggable. Something I've been
tinkering with is to put the planning behind a `trait` which can allow for
custom implementations. Something like:
```
trait BallistaPlanner {
fn plan_query_stages<'a>(
&'a mut self,
job_id: &'a str,
execution_plan: Arc<dyn ExecutionPlan>,
) -> Result<Vec<Arc<ShuffleWriterExec>>>;
}
trait BallistaPlannerFactory {
type Output: BallistaPlanner;
fn new_planner<T: 'static + AsLogicalPlan,U: 'static +
AsExecutionPlan>(state: &Arc<SchedulerState<T,U>>) -> Output;
}
```
The existing `BallistaPlanner` could be the default implementation of
course. But his could give us a nice mechanism for experimenting with different
implementations as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]