LuciferYang commented on a change in pull request #33852:
URL: https://github.com/apache/spark/pull/33852#discussion_r706293393



##########
File path: core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala
##########
@@ -165,7 +165,7 @@ private[r] class RBackendHandler(server: RBackend)
 
         // Write status bit
         writeInt(dos, 0)
-        writeObject(dos, ret.asInstanceOf[AnyRef], server.jvmObjectTracker)

Review comment:
       After some investigation for 
https://github.com/apache/spark/pull/33852#discussion_r699923891, I found the 
reasons for the random failure of these cases, which are summarized as 
follows(copy from SPARK-36636):
   
   Firstly, there are multiple consecutive cases in the `SparkContextSuite` 
using the local-cluster mode like `local-cluster[3, 1, 1024]`,  each 
local-cluster will start a new local standalone cluster and submit a new 
application with appid format `app-yyyyMMddHHmmss-0000`, like 
`app-202109102324-0000`.
   
   Therefore, if the cases with config `local-cluster[i, c, m]` are run 
continuously within 1s, worker directory collision will occur,  the evidence is 
that I found many similar logs as following :
   
   ```
   java.io.IOException: Failed to create directory 
/spark-mine/work/app-20210908074432-0000/1
        at 
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:578)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   21/09/08 22:44:32.266 dispatcher-event-loop-0 INFO Worker: Asked to launch 
executor app-20210908074432-0000/0 for test
   21/09/08 22:44:32.266 dispatcher-event-loop-0 ERROR Worker: Failed to launch 
executor app-20210908074432-0000/0 for test.
   java.io.IOException: Failed to create directory 
/spark-mine/work/app-20210908074432-0000/0
        at 
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:578)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   ```
   
   Since the default value of `spark.deploy.maxExecutorRetries` is 10, the 
following phenomena will occur when 5 consecutive cases with `local-cluster[3, 
1, 1024]` are completed within 1 second:
   
   1. case 1: use worker directories: `/app-202109102324-0000/0`, 
`/app-202109102324-0000/1`, `/app-202109102324-0000/2`
   2. case 2: retry 3 times then use worker directories: 
`/app-202109102324-0000/3`, `/app-202109102324-0000/4`, 
`/app-202109102324-0000/5`
   3. case 3: retry 6 times then use worker directories: 
`/app-202109102324-0000/6`, `/app-202109102324-0000/7`, 
`/app-202109102324-0000/8`
   4. case 4: retry 9 times then use worker directories: 
`/app-202109102324-0000/9`, `/app-202109102324-0000/10`, 
`/app-202109102324-0000/11`
   5. case 5: retry more than 10 times then failed
   
   I try to configure the default value of spark.deploy.maxexecutorretries to 
50 (maybe 15 is enough), there will be no failure.
   
   It seems that Scala 2.13 executes faster than Scala 2.12, so it is easier to 
fail because of this problem.
   
   Do you have any good suggestions for fix this issue? @srowen @HyukjinKwon 
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to