LuciferYang opened a new pull request, #37938:
URL: https://github.com/apache/spark/pull/37938
### What changes were proposed in this pull request?
After SPARK-17321, `YarnShuffleService` will persist data to local shuffle
state db/reload data from local shuffle state db only when Yarn NodeManager
start with `YarnConfiguration#NM_RECOVERY_ENABLED = true`.
`YarnShuffleIntegrationSuite` not set
`YarnConfiguration#NM_RECOVERY_ENABLED` and the default value of the
configuration is false, so `YarnShuffleIntegrationSuite` will neither trigger
data persistence to the db nor verify the reload of data.
This pr aims to let `YarnShuffleIntegrationSuite` restart the verification
of registeredExecFile reload scenarios, to achieve this goal, this pr make the
following changes:
1. Add a new un-document configuration
`spark.yarn.shuffle.integrationTesting` to `YarnShuffleService`, and Initialize
`_recoveryPath` when `_recoveryPath == null &&
spark.yarn.shuffle.integrationTesting == true`.
2. Only set `spark.yarn.shuffle.integrationTesting = true` in
`YarnShuffleIntegrationSuite`, and add assertions to check `registeredExecFile`
is not null to ensure that registeredExecFile reload scenarios will be verified.
### Why are the changes needed?
Fix registeredExecFile reload test scenarios.
Why not test by configuring `YarnConfiguration#NM_RECOVERY_ENABLED` as true?
This configuration has been tried
**Hadoop 3.3.4**
```
build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite
-Phadoop-3
```
```
2022-09-10T11:44:42.1710230Z Cause: java.lang.ClassNotFoundException:
org.apache.hadoop.shaded.org.iq80.leveldb.DBException
2022-09-10T11:44:42.1715234Z at
java.net.URLClassLoader.findClass(URLClassLoader.java:387)
2022-09-10T11:44:42.1719347Z at
java.lang.ClassLoader.loadClass(ClassLoader.java:419)
2022-09-10T11:44:42.1723090Z at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
2022-09-10T11:44:42.1726759Z at
java.lang.ClassLoader.loadClass(ClassLoader.java:352)
2022-09-10T11:44:42.1731028Z at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:313)
2022-09-10T11:44:42.1735424Z at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:370)
2022-09-10T11:44:42.1740303Z at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
2022-09-10T11:44:42.1745576Z at
org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597)
2022-09-10T11:44:42.1828858Z at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
2022-09-10T11:44:42.1829712Z at
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
2022-09-10T11:44:42.1830633Z at
org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:327)
2022-09-10T11:44:42.1831431Z at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
2022-09-10T11:44:42.1832279Z at
org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:112)
```
**Hadoop 2.7.4**
```
build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite
-Phadoop-2
```
```
YarnShuffleIntegrationWithLevelDBBackendSuite:
org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite
*** ABORTED ***
java.lang.IllegalArgumentException: Cannot support recovery with an
ephemeral server port. Check the setting of yarn.nodemanager.address
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:395)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at
org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceStart(MiniYARNCluster.java:560)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at
org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
...
Run completed in 3 seconds, 992 milliseconds.
Total number of tests run: 0
Suites: completed 1, aborted 1
Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
*** 1 SUITE ABORTED ***
```
From the above test, we need to use a fixed port to enable Yarn NodeManager
recovery, but this is difficult to be guaranteed in UT, so this pr try a
workaround way.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GitHub Actions
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]