[GitHub] [spark] zwangsheng opened a new pull request, #36358: [SPARK-39023] [K8s] Add Executor Pod inter-pod anti-affinity

GitBox Tue, 26 Apr 2022 02:35:51 -0700


zwangsheng opened a new pull request, #36358:
URL: https://github.com/apache/spark/pull/36358


   ### What changes were proposed in this pull request?
   Add Inter-Pod anti-affinity to Executor Pod.
   
   ### Why are the changes needed?
   ### Why should we need this?
   
   When Spark On Kubernetes is running, Executor Pod clusters occur in certain 
conditions (uneven resource allocation in Kubernetes, high load On some nodes, 
low load On some nodes), causing Shuffle data skew. This causes Spark 
application to fail or performance bottlenecks, such as Shuffle Fetch timeout 
and connection refuse after connection number.
   
   ### How does this PR help?
   
   Add the AntiAffinity feature to Executor Pod to ensure simple anti-affinity 
scheduling at Application granularity.
   
   Executor Pod Yaml represents:
   ```yaml
   podAntiAffinity:
     preferredDuringSchedulingIgnoredDuringExecution:
     weight: 100
     - podAffinityTerm:
          labelSelector:
            matchExpressions: 
            - key: spark-app-selector 
               operator: In 
               values: 
               - spark-test // appId
          topologyKey: kubernetes.io/hostname 
   ```
   
   ### Why should use this?
   
   The functionality mentioned in this PR was tested on a cluster.
   
   Using three Kubernetes Node(node-1, node-2, node-3).
   
   In the case of sufficient or insufficient cluster resources, it  has the 
same effect Whether the feature is enabled or not. Kubernetes assigns pods to 
nodes with low load based on global resources. When only one node has a small 
load, Kubernetes will schedule all executor pods to this node. 
   
   Here is the experiment  results:
   (Experiments show pods are scheduled to which node)
   
   Experiment 1:
   Three nodes hold a small amount of equal load.
   
   Enable Feature:
   | round | exec-1 | exec-2 | exec-3 | exec-4 | exec-5 | exec-6 | exec-7 |
   | ----- | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
   | 1     | node-1 | node-2 | node-3 | node-2 | node-1 | node-3 | node-1 |
   | 2     | node-1 | node-2 | node-3 | node-2 | node-1 | node-3 | node-1 |
   | 3     | node-1 | node-2 | node-3 | node-2 | node-1 | node-3 | node-1 |
   | 4     | node-1 | node-2 | node-3 | node-2 | node-1 | node-3 | node-1 |
   
   Disable Feature:
   | round | exec-1 | exec-2 | exec-3 | exec-4 | exec-5 | exec-6 | exec-7 |
   | ----- | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
   | 1     | node-1 | node-2 | node-3 | node-2 | node-1 | node-3 | node-1 |
   | 2     | node-1 | node-2 | node-3 | node-2 | node-1 | node-3 | node-1 |
   | 3     | node-1 | node-2 | node-3 | node-2 | node-1 | node-3 | node-1 |
   | 4     | node-1 | node-2 | node-3 | node-2 | node-1 | node-3 | node-1 |
   
   Experiment 2:
   Node 1 have no idle resources, Node 2 and Node 3 hold a small amount of 
equal load.
   
   Enable Feature:
   | round | exec-1 | exec-2 | exec-3 | exec-4 |
   | ----- | ------ | ------ | ------ | ------ |
   | 1     | node-2 | node-3 | node-2 | node-3 |
   | 1     | node-2 | node-3 | node-2 | node-3 |
   | 1     | node-2 | node-3 | node-2 | node-3 |
   | 1     | node-2 | node-3 | node-2 | node-3 |
   
   Disable Feature:
   | round | exec-1 | exec-2 | exec-3 | exec-4 |
   | ----- | ------ | ------ | ------ | ------ |
   | 1     | node-2 | node-3 | node-2 | node-3 |
   | 1     | node-2 | node-3 | node-2 | node-3 |
   | 1     | node-2 | node-3 | node-2 | node-3 |
   | 1     | node-2 | node-3 | node-2 | node-3 |
   
   ---
   
   If some nodes are busy or the load is unbalanced, leaving the feature off 
means that Kubernetes will pick the Node with the lowest load and distribute it 
until it is no longer the one with the lowest load. After the feature is 
enabled, it is first allocated to nodes with low load, and then allocated to 
nodes with low load except hassan according to the anti-affinity of Application 
granularity.
   
   Experiment 2:
   Node 1 have no idle resources, Node 2 has a higher load than Node 3.
   
   Enable Feature:
   | round | exec-1 | exec-2 | exec-3 | exec-4 |
   | ----- | ------ | ------ | ------ | ------ |
   | 1     | node-3 | node-2 | node-3 | node-2 |
   | 2     | node-3 | node-2 | node-3 | node-2 |
   | 3     | node-3 | node-2 | node-3 | node-2 |
   
   Disable Feature:
   | round | exec-1 | exec-2 | exec-3 | exec-4 |
   | ----- | ------ | ------ | ------ | ------ |
   | 1     | node-3 | node-3 | node-3 | node-3 |
   | 2     | node-3 | node-3 | node-3 | node-3 |
   | 3     | node-3 | node-3 | node-3 | node-3 |
   
   ---
   
   According to the above experimental results, we can see that under normal 
circumstances, after the feature is turned on, there will not be any difference 
from that when it is not turned on; In extreme cases, after enabling the 
feature, it can better alleviate the accumulation of Pods On a Node and prevent 
performance bottleneck to a certain extent.
   
   ### Will this make any difference?
   
   Add features to apply for the Executor of Pod add `AntiAffinity 
PreferredDuringSchedulingIgnoreedDuringExecution`, Used to add the Application 
granularity Preferred antiaffinity based on Kubernetes' global 
resource-oriented (and other customized scheduling policies). We hope 
Kubernetes can help us smooth out ExecutorPod from the Application level as 
much as possible while adhering to the original scheduling rules.
   
   ### Why choose this?
   
   > Why choose Inter-pod affinity and anti-affinity?
   
   We are currently concerned with gathering distribution between pods.
   
   >Why choose PreferredDuringSchedulingIgnoreedDuringExecution?
   
   According to the Kubernetes [Assigning Pods To 
Nodes](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity)，can
 see that Kubernetes provides us with 
`requiredDuringSchedulingIgnoredDuringExecution` & 
`preferredDuringSchedulingIgnoredDuringExecution`。
   
   -   `requiredDuringSchedulingIgnoredDuringExecution`: The scheduler can't 
schedule the Pod unless the rule is met. This functions like `nodeSelector`, 
but with a more expressive syntax.
   
   -   `preferredDuringSchedulingIgnoredDuringExecution`: The scheduler tries 
to find a node that meets the rule. If a matching node is not available, the 
scheduler still schedules the Pod.
   
   It's not hard to see，`preferredDuringSchedulingIgnoredDuringExecution` is 
more in line with the issues we raised in this PR. We want Kubernetes to help 
smooth out Executor pods as much as possible, but in the worst case, where 
there is only one Node with resources left, we also need to be able to allocate 
Executor pods to that one Node.
   
   >Why would we want to influence Kubernetes scheduling from Spark code?
   
   We need to antiaffinity executorPods from the granularity of the 
Application, so we choose to add the antiaffinity  before allocating the 
Executor and after generating applicationId.
   
   Due to the ApplicationId limitation, we could not pin it to the 
`pod-template.yaml`.
   
   > What are the negative effects?
   
   From the perspective of normal scheduling policy, there is no great negative 
impact. From experiments and theories, it can be known that the new feature 
will not affect the existing scheduling content. 
   
   However, to some extent, breaking Executor pods will affect the localization 
of Shuffle data, that is, the number of blocks in LocalHost will be reduced.
   
   Judging from the experiments and results so far, this trade-off is 
worthwhile.
   
   If we do not adopt the strategy of breaking up executors, they will gather 
under certain circumstances, leading to Shuffle data skews and significantly 
affecting task performance. More seriously, after the connection number is 
full, Executor Fetch Block fails and Stage fails. Even Application failure.
   
   Breaking executors increases network connectivity and traffic at the cluster 
level. However, the increased consumption is scattered in the cluster and not 
concentrated on several nodes, so the overall stability is acceptable.
   
   > Why does changing the number of Executor Pods affect Shuffle Data skew?
   
   The number of Executor Pods is related to the amount of Shuffle data. 
Currently, the Shuffle tilt problem can be alleviated by controlling the number 
of Executor Pods.
   
   At present, this is only a step of anti-affinity reference, and later may be 
carried out directly through shuffle quantity.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes
   
   ### How was this patch tested?
   Local


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zwangsheng opened a new pull request, #36358: [SPARK-39023] [K8s] Add Executor Pod inter-pod anti-affinity

Reply via email to