jameschen1519 opened a new issue #90: Completed (and sometimes deleted) pods are still marked as "Running" and consume resources URL: https://github.com/apache/incubator-yunikorn-core/issues/90 When running spark jobs via spark-submit on Kubernetes and applying the Yunikorn scheduler to a driver/executor pairing, upon termination the Yunikorn scheduler does not mark these jobs as complete, and the jobs are still marked as "Running". This takes up resources in the job queues to which the driver/executors are assigned, eventually resulting in resource starvation until the drivers are manually deleted. Unfortunately, deletion of these pods might not necessarily free up resources--too many cycles of starting and stopping Yunikorn-scheduled spark pods results in all resources being consumed, even when there are no more Yunikorn-scheduled spark pods available. (It is also worth noting that the driver and executor jobs enter the same queue regardless of what the executor podTemplateFile specifies. We are unsure of if this is a feature or a bug.) Listed below are the reproduction steps. Please let me know if any clarification is needed; thanks. ~~~~~~~~~~~~~~~~~~~~~~~~ Environment setup: For setting up the Yunikorn pods (Used the helm chart, but also tried setting it up manually): `helm install ./yunikorn --namespace test ./yunikorn --generate-name` queues.yaml snippet to be put into yunikorn with `kubectl -n test edit configmap yunikorn-scheduler`: ``` queues.yaml: | partitions: - name: default placementrules: - name: provided create: false queues: - name: root submitacl: '*' queues: - name: driver resources: guaranteed: memory: 10000 vcore: 1000 max: memory: 40000 vcore: 9000 - name: executors resources: guaranteed: memory: 1000 vcore: 1000 max: memory: 15000 vcore: 6000 ``` Command used: ``` spark-submit \ --master k8s://https://<YOUR K8S IP>:6443 \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.kubernetes.container.image=<YOUR SPARK IMAGE> \ --conf spark.kubernetes.namespace=test \ --conf spark.driver.extraClassPath=/opt/hadoop/etc/hadoop \ --conf spark.ssl.enabled=false \ --conf spark.authenticate=false \ --conf spark.kubernetes.driver.podTemplateFile=/driver.yaml \ --conf spark.kubernetes.executor.podTemplateFile=/executor.yaml \ --conf spark.network.crypto.enabled=false \ <YOUR SPARK JAR FILE; E.G. hdfs://<YOUR HDFS URL>/sparkexample.jar> & ``` Under driver.yaml: ``` apiVersion: v1 kind: Pod metadata: labels: spark-app-id: spark-00001 queue: root.driver spec: schedulerName: yunikorn ``` Under executor.yaml: ``` apiVersion: v1 kind: Pod metadata: labels: spark-app-id: spark-00001 queue: root.executors spec: schedulerName: yunikorn ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
