Romain Manni-Bucau created SPARK-55536:
------------------------------------------
Summary: Reusable PVC are not usable
Key: SPARK-55536
URL: https://issues.apache.org/jira/browse/SPARK-55536
Project: Spark
Issue Type: Bug
Components: Kubernetes
Affects Versions: 4.0.2
Reporter: Romain Manni-Bucau
Side note: tested on 4.0.1 and 4.0.2, didn't test on 4.1.1 but from main code I
assume it is affected as well.
Background:
https://issues.apache.org/jira/browse/SPARK-35416?focusedCommentId=18058245&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-18058245
Long story short: the same PVC can be assigned twice to different executors
(and even driver if the size limit+storage class are the same) leading to not
schedulable pods.
I assume the in memory storage of the state can be too laggy when the events
are behind the state.
Concretely i'm using thrift server (spark one) with a spark application and 95%
of the time the driver PVC is assigned to the "next" executor leading to the
application to never run since pod is not schedulable but also got some cases
where 2 executors were having the same PVC.
I suspect both are different issues:
# the one about a wrong state assumption
# the driver handling which is not specific
My proposal would be to always fetch PVC and check their status, if bound just
use another one + for not yet scheduled executor (but submitted) also check PVC
status if any to ensure to recover from the remaining edge cases.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]