LuciferYang commented on a change in pull request #33852:
URL: https://github.com/apache/spark/pull/33852#discussion_r706293393
##########
File path: core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala
##########
@@ -165,7 +165,7 @@ private[r] class RBackendHandler(server: RBackend)
// Write status bit
writeInt(dos, 0)
- writeObject(dos, ret.asInstanceOf[AnyRef], server.jvmObjectTracker)
Review comment:
After some investigation for
https://github.com/apache/spark/pull/33852#discussion_r699923891, I found the
reasons for the random failure of these cases, which are summarized as
follows(copy from SPARK-36636):
Firstly, there are multiple consecutive cases in the `SparkContextSuite`
using the local-cluster mode like `local-cluster[3, 1, 1024]`, each
local-cluster will start a new local standalone cluster and submit a new
application with appid format `app-yyyyMMddHHmmss-0000`, like
`app-202109102324-0000`.
Therefore, if the cases with config `local-cluster[i, c, m]` are run
continuously within 1s, worker directory collision will occur, the evidence is
that I found many similar logs as following :
```
java.io.IOException: Failed to create directory
/spark-mine/work/app-20210908074432-0000/1
at
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:578)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
21/09/08 22:44:32.266 dispatcher-event-loop-0 INFO Worker: Asked to launch
executor app-20210908074432-0000/0 for test
21/09/08 22:44:32.266 dispatcher-event-loop-0 ERROR Worker: Failed to launch
executor app-20210908074432-0000/0 for test.
java.io.IOException: Failed to create directory
/spark-mine/work/app-20210908074432-0000/0
at
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:578)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
Since the default value of `spark.deploy.maxExecutorRetries` is 10, the
following phenomena will occur when 5 consecutive cases with `local-cluster[3,
1, 1024]` are completed within 1 second:
1. case 1: use worker directories: `/app-202109102324-0000/0`,
`/app-202109102324-0000/1`, `/app-202109102324-0000/2`
2. case 2: retry 3 times then use worker directories:
`/app-202109102324-0000/3`, `/app-202109102324-0000/4`,
`/app-202109102324-0000/5`
3. case 3: retry 6 times then use worker directories:
`/app-202109102324-0000/6`, `/app-202109102324-0000/7`,
`/app-202109102324-0000/8`
4. case 4: retry 9 times then use worker directories:
`/app-202109102324-0000/9`, `/app-202109102324-0000/10`,
`/app-202109102324-0000/11`
5. case 5: retry more than 10 times then failed
I try to configure the default value of spark.deploy.maxexecutorretries to
50 (maybe 15 is enough), there will be no failure.
It seems that Scala 2.13 executes faster than Scala 2.12, so it is easier to
fail because of this problem.
Do you have any good suggestions for fix this issue? @srowen @HyukjinKwon
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]