Andrea Zito created SPARK-21991:
-----------------------------------

             Summary: [LAUNCHER] LauncherServer acceptConnections thread 
sometime dies if machine has very high load
                 Key: SPARK-21991
                 URL: https://issues.apache.org/jira/browse/SPARK-21991
             Project: Spark
          Issue Type: Bug
          Components: Spark Submit
    Affects Versions: 2.2.0, 2.1.1, 2.1.0, 2.0.2
         Environment: Single node machine running Ubuntu 16.04.2 LTS 
(4.4.0-79-generic)
YARN 2.7.2
Spark 2.0.2
            Reporter: Andrea Zito
            Priority: Minor


The way the _LauncherServer_ _acceptConnections_ thread schedules client 
timeouts causes (non-deterministically) the thread to die with the following 
exception if the machine is under very high load:

{noformat}
Exception in thread "LauncherServer-1" java.lang.IllegalStateException: Task 
already scheduled or cancelled
        at java.util.Timer.sched(Timer.java:401)
        at java.util.Timer.schedule(Timer.java:193)
        at 
org.apache.spark.launcher.LauncherServer.acceptConnections(LauncherServer.java:249)
        at 
org.apache.spark.launcher.LauncherServer.access$000(LauncherServer.java:80)
        at 
org.apache.spark.launcher.LauncherServer$1.run(LauncherServer.java:143)
{noformat}

The issue is related to the ordering of actions that the _acceptConnections_ 
thread uses to handle a client connection:

# create timeout action
# create client thread
# start client thread
# schedule timeout action

Under normal conditions the scheduling of the timeout action happen before the 
client thread has a chance to start, however if the machine is under very high 
load the client thread can receive CPU time before the timeout action gets 
scheduled.

If this condition happen, the client thread cancel the timeout action (which is 
not yet been scheduled) and goes on, but as soon as the _acceptConnections_ 
thread gets the CPU back, it will try to schedule the timeout action (which has 
already been canceled) thus raising the exception.

Changing the order in which the client thread gets started and the timeout gets 
scheduled seems to be sufficient to fix this issue.

As stated above the issue is non-deterministic, I faced the issue multiple 
times on a single-node machine submitting a high number of short jobs 
sequentially, but I couldn't easily create a test reproducing the issue. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to