Hi Team,
I am running Apache Spark  3.4.1 Application on K8s with the below
configuration related to executor rolling and Ignore Decommission Fetch
Failure.

spark.plugins: "org.apache.spark.scheduler.cluster.k8s.ExecutorRollPlugin"
spark.kubernetes.executor.rollInterval: "1800s"
spark.kubernetes.executor.rollPolicy: "OUTLIER_NO_FALLBACK"
spark.kubernetes.executor.minTasksPerExecutorBeforeRolling: "100"

spark.stage.ignoreDecommissionFetchFailure: "true"
spark.scheduler.maxRetainedRemovedDecommissionExecutors: "20"

spark.decommission.enabled: "true"
spark.storage.decommission.enabled: "true"
spark.storage.decommission.fallbackStorage.path: "some-s3-path"
spark.storage.decommission.shuffleBlocks.maxThreads: "16"

When an executor is decommissioned in the middle of the stage, I notice
that there are shuffle fetch failures in tasks and the above ignore
decommission configurations are not respected. The stage will go into
retry. The decommissioned executor logs clearly show the decommission was
fully graceful and blocks were replicated to other active
executors/fallback.

May I know how I should be using Executor Rolling, without triggering stage
failures? I am using executor rolling to avoid executors being removed by
K8s due to memory pressure or oom issues as my spark job is heavy on
shuffling and has a lot of window functions. Any help will be super useful.



Arun Ravi M V
B.Tech (Batch: 2010-2014)

Computer Science and Engineering

Govt. Model Engineering College
Cochin University Of Science And Technology
Kochi

Reply via email to