[
https://issues.apache.org/jira/browse/SPARK-57650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-57650.
----------------------------------
Fix Version/s: 4.3.0
4.2.1
Assignee: Hyukjin Kwon
Resolution: Fixed
https://github.com/apache/spark/pull/56715
> YarnClusterSuite tests intermittently time out due to the test mini-cluster's
> default AM resource limit
> -------------------------------------------------------------------------------------------------------
>
> Key: SPARK-57650
> URL: https://issues.apache.org/jira/browse/SPARK-57650
> Project: Spark
> Issue Type: Test
> Components: Tests, YARN
> Affects Versions: 4.2.0
> Reporter: Hyukjin Kwon
> Assignee: Hyukjin Kwon
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.3.0, 4.2.1
>
>
> h3. Symptom
> Several {{YarnClusterSuite}} tests fail intermittently on memory-constrained
> CI (observed on the scheduled Maven Scala 2.13 JDK 21 branch-4.2 build and
> the JDK 17 branch-4.x build) with a 3-minute timeout:
> {code}
> The code passed to eventually never returned normally. Attempted 190 times
> over 3.0 minutes. Last failure message: handle.getState().isFinal() was
> false. (BaseYarnClusterSuite.scala:213)
> {code}
> Affected tests: the two "ensuring redaction" tests, "yarn-cluster should
> respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630)", and
> the SPARK-35672 'local' URI scheme jar tests.
> h3. Root cause
> The mini {{CapacityScheduler}} set up in {{BaseYarnClusterSuite}} configures
> the queue but never sets
> {{yarn.scheduler.capacity.maximum-am-resource-percent}}, so it defaults to
> 0.1. On a small CI cluster that caps the queue's total AM resource budget at
> ~1GB, smaller than the 1-2GB AM/driver memory these tests request, so the
> applications get stuck in the ACCEPTED state (never activated) and the suite
> times out. The YARN diagnostics show {{Queue's AM resource limit exceeded. AM
> Resource Request = <memory:2048>; Queue Resource Limit for AM =
> <memory:1024>}} repeated >1000 times.
> h3. Fix
> Set {{maximum-am-resource-percent}} to 1.0 (global and root.default) in
> {{BaseYarnClusterSuite}} so test AMs can use the whole queue and applications
> are always activated. Test-only change; de-flakes deterministically
> regardless of runner memory.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]