realno edited a comment on pull request #1560:
URL: 
https://github.com/apache/arrow-datafusion/pull/1560#issuecomment-1016790718


   @yahoNanJing and @houqp thanks for the great discussion, and @houqp thanks 
for sharing the old issue, really good read, and happy to see there were some 
discussions happened before. 
   
   I definitely agree that there is no clear cut and each mode has its pros and 
cons, my preference is to use a simpler design that in practice works well 
enough. The reason is quite related to our use case - we are planning to deploy 
at least thousands of clusters allocating resources dynamically. So some key 
criteria are stability, error handling and recovery, and low management 
overhead per cluster. So this is the reasoning for simplifying communication 
model, simplifying state as they all have big impact on the above goal.
   
   > Because there's no master for the cluster topology, it's not so easy to do 
global optimization for the task assignment, like local shuffle, etc.
   
   I am not sure there will be a difference in optimization. I think we can 
achieve the same thing, the difference is the communication model.
   
   > It's not so efficient to fetch next task from the scheduler. Like current 
implement, it has to scan all of the tasks to check whether it's able to be 
scheduled. In the design, we also have attached our benchmark results. As the 
scheduler runs with scheduling more and more tasks, the performance of the 
poll/pull model will be downgraded. However, for the push model, it can achieve 
job level task checking and the task scheduling performance will not be 
downgraded.
   CPU waste and may need 100ms latency which is not good for interactive 
queries which may need to be finished within 1s.
   
   I think this two points are because of the current implementation, there are 
other ways to improve/fix. For example, have the scheduler to split the tasks 
based on partition and readiness instead of scanning the whole list. And change 
to a model that doesn't involve a sleep loop.
   
   > My current thinking on this is the scheduler state complexity introduced 
through the push model might be needed in the long run for optimal task 
scheduling when we can take more resource related factors into account instead 
of just available task slots. Having a global view of the system generally 
yields better task placements.
   
   This is a great point. My current thinking is to leverage modern 
orchestration technologies such as k8s to help. That is, leave it to k8s to 
figure out resource allocation on the physical cluster, we can set a few 
universal slot sizes for executors so we don't need to deal with hardware level 
RA - this is also for simplicity. And I know there will be use cases that still 
prefer the traditional way to manage a single physical cluster, we do need to 
evaluate the impact for that model. Though the main reason I am thinking about 
going away from Spark (and looking at DF as replacement) is its deployment 
model and operational cost comes with it.
   
   > On the scaling front, even though push based model incurs more state 
management overhead on the scheduler side, it has its own scaling strength as 
well. For example, active message exchanges between scheduler and executors 
scale much better with larger executor pool size because heart beat messages 
are much easier to process compared to task poll requests. It is also easier 
for the scheduler to reduce its own load by scheduling proactively schedule 
less tasks v.s. rate limiting executor poll requests in a polling model.
   
   Agreed. I think there are different ways to limit polling rate. especially 
if something like k8s comes to play. If the executors are more or less 
universally sized and dynamically managed we can potentially use auto-scale 
techniques to control that. That is, instead of all executors polling scheduler 
constantly, only new ones and ones have done their job need to poll.
   
    
   > So I think there is no clear cut here. Perhaps another angle would be a 
hybrid model where we still use the base model as the foundation , but try to 
move as much state management logic into the executors to reduce the overhead 
on the scheduler side.
   
   Completely agree. I want to add that since we are a community here it is 
great to share different view points and make a reasonable decision that we 
devote to improve and maintain going forward. I am hoping either way we choose 
to move forward with we will make it a competing (hopefully better) platform to 
Spark. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to