houqp edited a comment on pull request #1560: URL: https://github.com/apache/arrow-datafusion/pull/1560#issuecomment-1017186504
> This is a great point. My current thinking is to leverage modern orchestration technologies such as k8s to help. That is, leave it to k8s to figure out resource allocation on the physical cluster, we can set a few universal slot sizes for executors so we don't need to deal with hardware level RA - this is also for simplicity. And I know there will be use cases that still prefer the traditional way to manage a single physical cluster, we do need to evaluate the impact for that model. Though the main reason I am thinking about going away from Spark (and looking at DF as replacement) is its deployment model and operational cost comes with it. I think being able to leverage k8s (or other orchestration platform) would be a good feature for us to explore and support. Although I do have a fairly strong feeling on this topic. Specifically I think it would be good for ballista to be self contained and not have a hard dependency on complex technologies like k8s, nomad, etc. For the k8s scheduling feature, I think we should make it one of the options users can configure to use. I am 100% with you on putting simple deployment and operation model front and center as one of the big selling points for ballista. Just throwing out random ideas here. For k8s backed clusters, what if instead of running fixed set of executors ahead of time, we let the scheduler create and schedule new pod for each new task it needs to assign. Then each executor pod will run to completion and report back task status before exit. This way, we avoid all the unnecessary polling required for a long running executor and can let k8s fully manage the resource allocation at the task level. There will be no state management required from the schedule's point of view as well because it basically deals with infinite number of executors. This setup can be easily extended to other container orchestration systems as well like AWS ECS. In fact, we can even run each task as an AWS lambda function if we want. > Completely agree. I want to add that since we are a community here it is great to share different view points and make a reasonable decision that we devote to improve and maintain going forward. I am hoping either way we choose to move forward with we will make it a competing (hopefully better) platform to Spark. Perhaps like what @yahoNanJing said earlier, we can keep both modes for now and make the final decision on this when we are able to gather more data from running different real world workloads on larger clusters? Technical decisions are usually much easier to make when they are backed by real data :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
