[jira] [Commented] (SPARK-16379) Spark on mesos is broken due to race condition in Logging

Sean Owen (JIRA) Tue, 05 Jul 2016 06:31:33 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362494#comment-15362494
 ]


Sean Owen commented on SPARK-16379:
-----------------------------------

A little more context does show the role that Logging plays:

{code}
"Thread-15" #67 prio=5 os_prio=0 tid=0x00007f0abc004000 nid=0x12b0 waiting for 
monitor entry [0x00007f0ad37fd000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.log$lzycompute(CoarseGrainedSchedulerBackend.scala:43)
        - waiting to lock <0x00000000ed78bb68> (a 
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.log(CoarseGrainedSchedulerBackend.scala:43)
        at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:48)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.logInfo(CoarseGrainedSchedulerBackend.scala:43)
        at 
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.registered(MesosCoarseGrainedSchedulerBackend.scala:247)
{code}

{code}
"main" #1 prio=5 os_prio=0 tid=0x00007f0bc8011800 nid=0x1269 waiting on 
condition [0x00007f0bcfa24000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000ed7b8558> (a 
java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
        at 
org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils$class.startScheduler(MesosSchedulerUtils.scala:135)
        - locked <0x00000000ed78bb68> (a 
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend)
        at 
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.startScheduler(MesosCoarseGrainedSchedulerBackend.scala:48)
        at 
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.start(MesosCoarseGrainedSchedulerBackend.scala:157)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:155)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)
        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256)
{code}

It's not a direct deadlock, but a problem of a callback holding a lock while 
waiting for its callback, which can't proceed without the lock.
I'd rather actually fix this problem. One simplistic solution is to remove the 
logInfo statement. Another is to remove the synchronized block in 
MesosSchedulerUtils.startScheduler; I'm not 100% clear why it's needed, when it 
seems like it's just protecting {{mesosDriver}}.

> Spark on mesos is broken due to race condition in Logging
> ---------------------------------------------------------
>
>                 Key: SPARK-16379
>                 URL: https://issues.apache.org/jira/browse/SPARK-16379
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.0
>            Reporter: Stavros Kontopoulos
>            Priority: Blocker
>         Attachments: out.txt
>
>
> This commit introduced a transient lazy log val: 
> https://github.com/apache/spark/commit/044971eca0ff3c2ce62afa665dbd3072d52cbbec
> This has caused problems in the past:
> https://github.com/apache/spark/pull/1004
> One commit before that everything works fine.
> I spotted that when my CI started to fail:
> https://ci.typesafe.com/job/mit-docker-test-ref/191/
> You can easily verify it by installing mesos on your machine and try to 
> connect with spark shell from bin dir:
> ./spark-shell --master mesos://zk://localhost:2181/mesos --conf 
> spark.executor.url=$(pwd)/../spark-2.0.0-SNAPSHOT-bin-test.tgz
> It gets stuck at the point where it tries to create the SparkContext.
> Logging gets stuck here:
> I0705 12:10:10.076617  9303 group.cpp:700] Trying to get 
> '/mesos/json.info_0000000152' in ZooKeeper
> I0705 12:10:10.076920  9304 detector.cpp:479] A new leading master 
> ([email protected]:5050) is detected
> I0705 12:10:10.076956  9303 sched.cpp:326] New master detected at 
> [email protected]:5050
> I0705 12:10:10.077057  9303 sched.cpp:336] No credentials provided. 
> Attempting to register without authentication
> I0705 12:10:10.090709  9301 sched.cpp:703] Framework registered with 
> 13553f8b-f42c-4f20-88cd-16f1cc153ede-0001
> I verified it also by changing @transient lazy val log to def and it works as 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-16379) Spark on mesos is broken due to race condition in Logging

Reply via email to