[GitHub] [spark] LuciferYang opened a new pull request #33963: [SPARK-36636][CORE][TEST] LocalSparkCluster change to use tmp workdir in test to avoid directory name collision

GitBox Fri, 10 Sep 2021 11:20:45 -0700


LuciferYang opened a new pull request #33963:
URL: https://github.com/apache/spark/pull/33963



   ### What changes were proposed in this pull request?
   As described in SPARK-36636，if the test cases with config `local-cluster[n, 
c, m]`  are run continuously within 1 second, the workdir name collision will 
occur because appid use format as `app-yyyyMMddHHmmss-0000` and workdir name 
associated with it  in test now,  the related logs are as follows:
   
   ```
   java.io.IOException: Failed to create directory 
/spark-mine/work/app-20210908074432-0000/1
        at 
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:578)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   21/09/08 22:44:32.266 dispatcher-event-loop-0 INFO Worker: Asked to launch 
executor app-20210908074432-0000/0 for test
   21/09/08 22:44:32.266 dispatcher-event-loop-0 ERROR Worker: Failed to launch 
executor app-20210908074432-0000/0 for test.
   java.io.IOException: Failed to create directory 
/spark-mine/work/app-20210908074432-0000/0
        at 
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:578)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   ```
   
   Since the default value of s`park.deploy.maxExecutorRetries` is 10, the test 
failed will occur when 5 consecutive cases with local-cluster[3, 1, 1024] are 
completed within 1 second:
   
   1. case 1: use worker directories: `/app-202109102324-0000/0`, 
`/app-202109102324-0000/1`, `/app-202109102324-0000/2`
   2. case 2: retry 3 times then use worker directories: 
`/app-202109102324-0000/3`, `/app-202109102324-0000/4`, 
`/app-202109102324-0000/5`
   3. case 3: retry 6 times then use worker directories: 
`/app-202109102324-0000/6`, `/app-202109102324-0000/7`, 
`/app-202109102324-0000/8`
   4. case 4: retry 9 times then use worker directories: 
`/app-202109102324-0000/9`, `/app-202109102324-0000/10`, 
`/app-202109102324-0000/11`
   5. case 5: retry more than **10** times then **failed**
   
   To avoid this issue, this pr change to use tmp workdir in test with  config 
`local-cluster[n, c, m]`.
   
   
   
   ### Why are the changes needed?
   Avoid UT failures caused by continuous workdir name collision.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   
   - Pass GA or Jenkins Tests.
   - Manual test: **Additional information will be provided**
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] LuciferYang opened a new pull request #33963: [SPARK-36636][CORE][TEST] LocalSparkCluster change to use tmp workdir in test to avoid directory name collision

Reply via email to