Hi Team, I am running Apache Spark 3.4.1 Application on K8s with the below configuration related to executor rolling and Ignore Decommission Fetch Failure.
spark.plugins: "org.apache.spark.scheduler.cluster.k8s.ExecutorRollPlugin" spark.kubernetes.executor.rollInterval: "1800s" spark.kubernetes.executor.rollPolicy: "OUTLIER_NO_FALLBACK" spark.kubernetes.executor.minTasksPerExecutorBeforeRolling: "100" spark.stage.ignoreDecommissionFetchFailure: "true" spark.scheduler.maxRetainedRemovedDecommissionExecutors: "20" spark.decommission.enabled: "true" spark.storage.decommission.enabled: "true" spark.storage.decommission.fallbackStorage.path: "some-s3-path" spark.storage.decommission.shuffleBlocks.maxThreads: "16" When an executor is decommissioned in the middle of the stage, I notice that there are shuffle fetch failures in tasks and the above ignore decommission configurations are not respected. The stage will go into retry. The decommissioned executor logs clearly show the decommission was fully graceful and blocks were replicated to other active executors/fallback. May I know how I should be using Executor Rolling, without triggering stage failures? I am using executor rolling to avoid executors being removed by K8s due to memory pressure or oom issues as my spark job is heavy on shuffling and has a lot of window functions. Any help will be super useful. Arun Ravi M V B.Tech (Batch: 2010-2014) Computer Science and Engineering Govt. Model Engineering College Cochin University Of Science And Technology Kochi