zwangsheng opened a new pull request, #36358:
URL: https://github.com/apache/spark/pull/36358
### What changes were proposed in this pull request?
Add Inter-Pod anti-affinity to Executor Pod.
### Why are the changes needed?
### Why should we need this?
When Spark On Kubernetes is running, Executor Pod clusters occur in certain
conditions (uneven resource allocation in Kubernetes, high load On some nodes,
low load On some nodes), causing Shuffle data skew. This causes Spark
application to fail or performance bottlenecks, such as Shuffle Fetch timeout
and connection refuse after connection number.
### How does this PR help?
Add the AntiAffinity feature to Executor Pod to ensure simple anti-affinity
scheduling at Application granularity.
Executor Pod Yaml represents:
```yaml
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
weight: 100
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: spark-app-selector
operator: In
values:
- spark-test // appId
topologyKey: kubernetes.io/hostname
```
### Why should use this?
The functionality mentioned in this PR was tested on a cluster.
Using three Kubernetes Node(node-1, node-2, node-3).
In the case of sufficient or insufficient cluster resources, it has the
same effect Whether the feature is enabled or not. Kubernetes assigns pods to
nodes with low load based on global resources. When only one node has a small
load, Kubernetes will schedule all executor pods to this node.
Here is the experiment results:
(Experiments show pods are scheduled to which node)
Experiment 1:
Three nodes hold a small amount of equal load.
Enable Feature:
| round | exec-1 | exec-2 | exec-3 | exec-4 | exec-5 | exec-6 | exec-7 |
| ----- | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
| 1 | node-1 | node-2 | node-3 | node-2 | node-1 | node-3 | node-1 |
| 2 | node-1 | node-2 | node-3 | node-2 | node-1 | node-3 | node-1 |
| 3 | node-1 | node-2 | node-3 | node-2 | node-1 | node-3 | node-1 |
| 4 | node-1 | node-2 | node-3 | node-2 | node-1 | node-3 | node-1 |
Disable Feature:
| round | exec-1 | exec-2 | exec-3 | exec-4 | exec-5 | exec-6 | exec-7 |
| ----- | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
| 1 | node-1 | node-2 | node-3 | node-2 | node-1 | node-3 | node-1 |
| 2 | node-1 | node-2 | node-3 | node-2 | node-1 | node-3 | node-1 |
| 3 | node-1 | node-2 | node-3 | node-2 | node-1 | node-3 | node-1 |
| 4 | node-1 | node-2 | node-3 | node-2 | node-1 | node-3 | node-1 |
Experiment 2:
Node 1 have no idle resources, Node 2 and Node 3 hold a small amount of
equal load.
Enable Feature:
| round | exec-1 | exec-2 | exec-3 | exec-4 |
| ----- | ------ | ------ | ------ | ------ |
| 1 | node-2 | node-3 | node-2 | node-3 |
| 1 | node-2 | node-3 | node-2 | node-3 |
| 1 | node-2 | node-3 | node-2 | node-3 |
| 1 | node-2 | node-3 | node-2 | node-3 |
Disable Feature:
| round | exec-1 | exec-2 | exec-3 | exec-4 |
| ----- | ------ | ------ | ------ | ------ |
| 1 | node-2 | node-3 | node-2 | node-3 |
| 1 | node-2 | node-3 | node-2 | node-3 |
| 1 | node-2 | node-3 | node-2 | node-3 |
| 1 | node-2 | node-3 | node-2 | node-3 |
---
If some nodes are busy or the load is unbalanced, leaving the feature off
means that Kubernetes will pick the Node with the lowest load and distribute it
until it is no longer the one with the lowest load. After the feature is
enabled, it is first allocated to nodes with low load, and then allocated to
nodes with low load except hassan according to the anti-affinity of Application
granularity.
Experiment 2:
Node 1 have no idle resources, Node 2 has a higher load than Node 3.
Enable Feature:
| round | exec-1 | exec-2 | exec-3 | exec-4 |
| ----- | ------ | ------ | ------ | ------ |
| 1 | node-3 | node-2 | node-3 | node-2 |
| 2 | node-3 | node-2 | node-3 | node-2 |
| 3 | node-3 | node-2 | node-3 | node-2 |
Disable Feature:
| round | exec-1 | exec-2 | exec-3 | exec-4 |
| ----- | ------ | ------ | ------ | ------ |
| 1 | node-3 | node-3 | node-3 | node-3 |
| 2 | node-3 | node-3 | node-3 | node-3 |
| 3 | node-3 | node-3 | node-3 | node-3 |
---
According to the above experimental results, we can see that under normal
circumstances, after the feature is turned on, there will not be any difference
from that when it is not turned on; In extreme cases, after enabling the
feature, it can better alleviate the accumulation of Pods On a Node and prevent
performance bottleneck to a certain extent.
### Will this make any difference?
Add features to apply for the Executor of Pod add `AntiAffinity
PreferredDuringSchedulingIgnoreedDuringExecution`, Used to add the Application
granularity Preferred antiaffinity based on Kubernetes' global
resource-oriented (and other customized scheduling policies). We hope
Kubernetes can help us smooth out ExecutorPod from the Application level as
much as possible while adhering to the original scheduling rules.
### Why choose this?
> Why choose Inter-pod affinity and anti-affinity?
We are currently concerned with gathering distribution between pods.
>Why choose PreferredDuringSchedulingIgnoreedDuringExecution?
According to the Kubernetes [Assigning Pods To
Nodes](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity),can
see that Kubernetes provides us with
`requiredDuringSchedulingIgnoredDuringExecution` &
`preferredDuringSchedulingIgnoredDuringExecution`。
- `requiredDuringSchedulingIgnoredDuringExecution`: The scheduler can't
schedule the Pod unless the rule is met. This functions like `nodeSelector`,
but with a more expressive syntax.
- `preferredDuringSchedulingIgnoredDuringExecution`: The scheduler tries
to find a node that meets the rule. If a matching node is not available, the
scheduler still schedules the Pod.
It's not hard to see,`preferredDuringSchedulingIgnoredDuringExecution` is
more in line with the issues we raised in this PR. We want Kubernetes to help
smooth out Executor pods as much as possible, but in the worst case, where
there is only one Node with resources left, we also need to be able to allocate
Executor pods to that one Node.
>Why would we want to influence Kubernetes scheduling from Spark code?
We need to antiaffinity executorPods from the granularity of the
Application, so we choose to add the antiaffinity before allocating the
Executor and after generating applicationId.
Due to the ApplicationId limitation, we could not pin it to the
`pod-template.yaml`.
> What are the negative effects?
From the perspective of normal scheduling policy, there is no great negative
impact. From experiments and theories, it can be known that the new feature
will not affect the existing scheduling content.
However, to some extent, breaking Executor pods will affect the localization
of Shuffle data, that is, the number of blocks in LocalHost will be reduced.
Judging from the experiments and results so far, this trade-off is
worthwhile.
If we do not adopt the strategy of breaking up executors, they will gather
under certain circumstances, leading to Shuffle data skews and significantly
affecting task performance. More seriously, after the connection number is
full, Executor Fetch Block fails and Stage fails. Even Application failure.
Breaking executors increases network connectivity and traffic at the cluster
level. However, the increased consumption is scattered in the cluster and not
concentrated on several nodes, so the overall stability is acceptable.
> Why does changing the number of Executor Pods affect Shuffle Data skew?
The number of Executor Pods is related to the amount of Shuffle data.
Currently, the Shuffle tilt problem can be alleviated by controlling the number
of Executor Pods.
At present, this is only a step of anti-affinity reference, and later may be
carried out directly through shuffle quantity.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
Local
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]