Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21241#discussion_r186853331
--- Diff: docs/running-on-kubernetes.md ---
@@ -561,6 +561,13 @@ specific to Spark on Kubernetes.
This is distinct from <code>spark.executor.cores</code>: it is only
used and takes precedence over <code>spark.executor.cores</code> for specifying
the executor pod cpu request if set. Task
parallelism, e.g., number of tasks an executor can run concurrently is
not affected by this.
</tr>
+<tr>
+ <td><code>spark.kubernetes.executor.maxInitFailures</code></td>
+ <td>10</td>
+ <td>
+ Maximum number of times executors are allowed to fail with an
Init:Error state before failing the application. Note that Init:Error failures
should not be caused by Spark itself because Spark does not attach
init-containers to pods. Init-containers can be attached by the cluster itself.
Users should check with their cluster administrator if these kinds of failures
to start the executor pod occur frequently.
--- End diff --
We can also just start with a minimal set and just keep adding them as we
find more root causes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]