thinkharderdev opened a new issue, #30: URL: https://github.com/apache/arrow-ballista/issues/30
No that Ballista is it’s own top-level project, I want to extend a previous discussion around what exactly Ballista is (previous discussion https://github.com/apache/arrow-datafusion/issues/1916) I believe the consensus from that discussion was that Ballista should be a standalone system but in practice I think we have adopted somewhat of a “both and” approach in the sense that we have added several extension point to Ballista (while also providing default implementations that allow you to run Ballista out-of-the-box). I think this is a reasonable direction and agree that it is important that Ballista be something you can use “as is”. That raises the question of what are the use cases for Ballista that we would like to optimize for? To take a concrete example, I think the current architecture is optimized for batch processing using a straightforward map-reduce implementation. This has many advantages: * Scheduling is relatively straightforward. * Cluster utilization is efficient since the unit of schedulable work is a single task, so whenever a task slot frees up you can just schedule a pending task on it. * It is resilient to task failures since all intermediate results are serialized to disk so recovering from a spurious task failure is just a matter of rescheduling the task. However, for non-batch oriented work it has some serious drawbacks * Results can only be returned once the entire query finishes. We can’t start streaming results as soon as they are available. * Stages are scheduled sequentially so each stage is bound by its slowest partition. * Queries that return large resultsets can be quite resource intensive as each partition must serialize its entire shuffle result to disk There has already been some excellent proof-of-concept work done on a different scheduling paradigm in https://github.com/apache/arrow-datafusion/pull/1842 (something my team hopes to help push forward in the near future) but this raises some questions of it’s own, Namely, when do we use streaming vs map-reduce execution? In the PoC it is a very simple heuristic but in real uses cases I’m not sure there is a one-size-fits-all solution. In some cases you may want to use only streaming execution and use auto-scaling or dynamic partitioning to make sure each query can be scheduled promptly. Or you may only care about resource utilization and want to disable streaming execution entirely. This is one particular example but you can imagine many others. To bring this back around to some concrete options, I think there are a few different ways we can go: 1. Ballista is a distributed computing framework which can be customized for many different use cases. There are default implementations available which allow you to use Ballista as a standalone system but the internal implementation is defined in terms of interfaces which allow for customized implementations as needed. 2. Ballista is a standalone system with limited customization capabilities and is highly optimized for its intended use case. You can plugin certain things (ObjectStore implementations, UDFs, UDAFs, etc) but the core functionality (scheduler, state management, etc) is what it is and if it doesn’t work for your use case then this is not the right solution for you. 3. Ballista is a standalone system which is highly configurable but not highly extensible. It ships with, for instance, schedulers optimized for different use cases which can be enabled through runtime configuration (or build-time features) but you can’t just plugin your own custom scheduler implementation (without upstreaming it :)). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
