realno commented on pull request #1560:
URL:
https://github.com/apache/arrow-datafusion/pull/1560#issuecomment-1016790718
@yahoNanJing and @houqp thanks for the great discussion, and @houqp thanks
for sharing the old issue, really good read, and happy to see there were some
discussions happened before.
I definitely agree that there is no clear cut and each mode has its pros and
cons, my preference is to use a simpler design that in practice works well
enough. The reason is quite related to our use case - we are planning to deploy
at least thousands of clusters allocating resources dynamically. So some key
criteria are stability, error handling and recovery, and low management
overhead per cluster. So this is the reasoning for simplifying communication
model, simplifying state as they all have big impact on the above goal.
> Because there's no master for the cluster topology, it's not so easy to do
global optimization for the task assignment, like local shuffle, etc.
I am not sure there will be a difference in optimization. I think we can
achieve the same thing, the difference is the communication model.
> It's not so efficient to fetch next task from the scheduler. Like current
implement, it has to scan all of the tasks to check whether it's able to be
scheduled. In the design, we also have attached our benchmark results. As the
scheduler runs with scheduling more and more tasks, the performance of the
poll/pull model will be downgraded. However, for the push model, it can achieve
job level task checking and the task scheduling performance will not be
downgraded.
CPU waste and may need 100ms latency which is not good for interactive
queries which may need to be finished within 1s.
I think this two points are because of the current implementation, there are
other ways to improve/fix. For example, have the scheduler to split the tasks
based on partition and readiness instead of scanning the whole list. And change
to a model that doesn't involve a sleep loop.
> My current thinking on this is the scheduler state complexity introduced
through the push model might be needed in the long run for optimal task
scheduling when we can take more resource related factors into account instead
of just available task slots. Having a global view of the system generally
yields better task placements.
This is a great point. My current thinking is to leverage modern
orchestration technologies such as k8s to help. That is, leave it to k8s to
figure out resource allocation on the physical cluster, we can set a few
universal slot sizes for executors so we don't need to deal with hardware level
RA - this is also for simplicity. And I know there will be use cases that still
prefer the traditional way to manage a single physical cluster, we do need to
evaluate the impact for that model. Though the main reason I am thinking about
going away from Spark (and looking at DF as replacement) is its deployment
model and operational cost comes with it.
> On the scaling front, even though push based model incurs more state
management overhead on the scheduler side, it has its own scaling strength as
well. For example, active message exchanges between scheduler and executors
scale much better with larger executor pool size because heart beat messages
are much easier to process compared to task poll requests. It is also easier
for the scheduler to reduce its own load by scheduling proactively schedule
less tasks v.s. rate limiting executor poll requests in a polling model.
Agreed. I think there are different ways to limit polling rate. especially
if something like k8s comes to play. If the executors are more or less
universally sized and dynamically managed we can potentially use auto-scale
techniques to control that. That is, instead of all executors polling scheduler
constantly, only new ones and ones have done their job need to poll.
> So I think there is no clear cut here. Perhaps another angle would be a
hybrid model where we still use the base model as the foundation , but try to
move as much state management logic into the executors to reduce the overhead
on the scheduler side.
Completely agree. I want to add that since we are a community here it is
great to share different opinions and make a reasonable decision that we devote
to improve and maintain going forward. I am hoping either way we choose to move
forward with we will make it a competing (hopefully better) platform to Spark.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]