houqp edited a comment on pull request #1560:
URL: 
https://github.com/apache/arrow-datafusion/pull/1560#issuecomment-1017186504


   > This is a great point. My current thinking is to leverage modern 
orchestration technologies such as k8s to help. That is, leave it to k8s to 
figure out resource allocation on the physical cluster, we can set a few 
universal slot sizes for executors so we don't need to deal with hardware level 
RA - this is also for simplicity. And I know there will be use cases that still 
prefer the traditional way to manage a single physical cluster, we do need to 
evaluate the impact for that model. Though the main reason I am thinking about 
going away from Spark (and looking at DF as replacement) is its deployment 
model and operational cost comes with it.
   
   I think being able to leverage k8s (or other orchestration platform) would 
be a good feature for us to explore and support. Although I do have a fairly 
strong feeling on this topic. Specifically I think it would be good for 
ballista to be self contained and not have a hard dependency on complex 
technologies like k8s, nomad, etc. For the k8s scheduling feature, I think we 
should make it one of the options users can configure to use. I am 100% with 
you on putting simple deployment and operation model front and center as one of 
the big selling points for ballista.
   
   Just throwing out random ideas here. For k8s backed clusters, what if 
instead of running fixed set of executors ahead of time, we let the scheduler 
create and schedule new pod for each new task it needs to assign. Then each 
executor pod will run to completion and report back task status before exit. 
This way, we avoid all the unnecessary polling required for a long running 
executor and can let k8s fully manage the resource allocation at the task 
level. There will be no state management required from the schedule's point of 
view as well because it basically deals with an "infinite" number of executors 
ready to accept tasks. This setup can be easily extended to other container 
orchestration systems as well like AWS ECS. In fact, we can even run each task 
as an AWS lambda function if we want.
   
   > Completely agree. I want to add that since we are a community here it is 
great to share different view points and make a reasonable decision that we 
devote to improve and maintain going forward. I am hoping either way we choose 
to move forward with we will make it a competing (hopefully better) platform to 
Spark.
   
   Perhaps like what @yahoNanJing said earlier, we can keep both modes for now 
and make the final decision on this when we are able to gather more data from 
running different real world workloads on larger clusters? Technical decisions 
are usually much easier to make when they are backed by real data :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to