[ 
https://issues.apache.org/jira/browse/SPARK-57650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-57650.
----------------------------------
    Fix Version/s: 4.3.0
                   4.2.1
         Assignee: Hyukjin Kwon
       Resolution: Fixed

https://github.com/apache/spark/pull/56715

> YarnClusterSuite tests intermittently time out due to the test mini-cluster's 
> default AM resource limit
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-57650
>                 URL: https://issues.apache.org/jira/browse/SPARK-57650
>             Project: Spark
>          Issue Type: Test
>          Components: Tests, YARN
>    Affects Versions: 4.2.0
>            Reporter: Hyukjin Kwon
>            Assignee: Hyukjin Kwon
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.3.0, 4.2.1
>
>
> h3. Symptom
> Several {{YarnClusterSuite}} tests fail intermittently on memory-constrained 
> CI (observed on the scheduled Maven Scala 2.13 JDK 21 branch-4.2 build and 
> the JDK 17 branch-4.x build) with a 3-minute timeout:
> {code}
> The code passed to eventually never returned normally. Attempted 190 times 
> over 3.0 minutes. Last failure message: handle.getState().isFinal() was 
> false. (BaseYarnClusterSuite.scala:213)
> {code}
> Affected tests: the two "ensuring redaction" tests, "yarn-cluster should 
> respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630)", and 
> the SPARK-35672 'local' URI scheme jar tests.
> h3. Root cause
> The mini {{CapacityScheduler}} set up in {{BaseYarnClusterSuite}} configures 
> the queue but never sets 
> {{yarn.scheduler.capacity.maximum-am-resource-percent}}, so it defaults to 
> 0.1. On a small CI cluster that caps the queue's total AM resource budget at 
> ~1GB, smaller than the 1-2GB AM/driver memory these tests request, so the 
> applications get stuck in the ACCEPTED state (never activated) and the suite 
> times out. The YARN diagnostics show {{Queue's AM resource limit exceeded. AM 
> Resource Request = <memory:2048>; Queue Resource Limit for AM = 
> <memory:1024>}} repeated >1000 times.
> h3. Fix
> Set {{maximum-am-resource-percent}} to 1.0 (global and root.default) in 
> {{BaseYarnClusterSuite}} so test AMs can use the whole queue and applications 
> are always activated. Test-only change; de-flakes deterministically 
> regardless of runner memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to