LuciferYang opened a new pull request #33963:
URL: https://github.com/apache/spark/pull/33963
### What changes were proposed in this pull request?
As described in SPARK-36636,if the test cases with config `local-cluster[n,
c, m]` are run continuously within 1 second, the workdir name collision will
occur because appid use format as `app-yyyyMMddHHmmss-0000` and workdir name
associated with it in test now, the related logs are as follows:
```
java.io.IOException: Failed to create directory
/spark-mine/work/app-20210908074432-0000/1
at
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:578)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
21/09/08 22:44:32.266 dispatcher-event-loop-0 INFO Worker: Asked to launch
executor app-20210908074432-0000/0 for test
21/09/08 22:44:32.266 dispatcher-event-loop-0 ERROR Worker: Failed to launch
executor app-20210908074432-0000/0 for test.
java.io.IOException: Failed to create directory
/spark-mine/work/app-20210908074432-0000/0
at
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:578)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
Since the default value of s`park.deploy.maxExecutorRetries` is 10, the test
failed will occur when 5 consecutive cases with local-cluster[3, 1, 1024] are
completed within 1 second:
1. case 1: use worker directories: `/app-202109102324-0000/0`,
`/app-202109102324-0000/1`, `/app-202109102324-0000/2`
2. case 2: retry 3 times then use worker directories:
`/app-202109102324-0000/3`, `/app-202109102324-0000/4`,
`/app-202109102324-0000/5`
3. case 3: retry 6 times then use worker directories:
`/app-202109102324-0000/6`, `/app-202109102324-0000/7`,
`/app-202109102324-0000/8`
4. case 4: retry 9 times then use worker directories:
`/app-202109102324-0000/9`, `/app-202109102324-0000/10`,
`/app-202109102324-0000/11`
5. case 5: retry more than **10** times then **failed**
To avoid this issue, this pr change to use tmp workdir in test with config
`local-cluster[n, c, m]`.
### Why are the changes needed?
Avoid UT failures caused by continuous workdir name collision.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass GA or Jenkins Tests.
- Manual test: **Additional information will be provided**
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]