[ 
https://issues.apache.org/jira/browse/SPARK-16379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15363199#comment-15363199
 ] 

Sean Owen commented on SPARK-16379:
-----------------------------------

I don't agree with that logic; it's entirely possible that code has a bug 
that's only revealed when some other legitimate change happens, and the right 
subsequent change is to fix the bug. I don't think we'd ban lazy vals either. 
Arguably it's "synchronized" that is the issue here, really.

Indeed, reverting the last patch only 'fixes' it because the code contained a 
hack to avoid this condition. The previous code also involved acquiring a lock, 
and I'm guessing it _could_ still be a problem, though less likely to come up 
given that the locking only happens during the first call (well hopefully). 
Removing the logInfo actually removes the issue more directly than 
reintroducing the hack. Changing the startScheduler method is _probably_ the 
right-er fix, though that's less conservative.

I'm not against reverting the change just on the grounds that Logging is 
inherited lots of places and so there's a risk of a repeat of this problem 
elsewhere, even if it may ultimately also be due to some other coding problem. 
I'd just rather not also reintroduce the hack.

> Spark on mesos is broken due to race condition in Logging
> ---------------------------------------------------------
>
>                 Key: SPARK-16379
>                 URL: https://issues.apache.org/jira/browse/SPARK-16379
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.0
>            Reporter: Stavros Kontopoulos
>            Priority: Blocker
>         Attachments: out.txt
>
>
> This commit introduced a transient lazy log val: 
> https://github.com/apache/spark/commit/044971eca0ff3c2ce62afa665dbd3072d52cbbec
> This has caused problems in the past:
> https://github.com/apache/spark/pull/1004
> One commit before that everything works fine.
> I spotted that when my CI started to fail:
> https://ci.typesafe.com/job/mit-docker-test-ref/191/
> You can easily verify it by installing mesos on your machine and try to 
> connect with spark shell from bin dir:
> ./spark-shell --master mesos://zk://localhost:2181/mesos --conf 
> spark.executor.url=$(pwd)/../spark-2.0.0-SNAPSHOT-bin-test.tgz
> It gets stuck at the point where it tries to create the SparkContext.
> Logging gets stuck here:
> I0705 12:10:10.076617  9303 group.cpp:700] Trying to get 
> '/mesos/json.info_0000000152' in ZooKeeper
> I0705 12:10:10.076920  9304 detector.cpp:479] A new leading master 
> ([email protected]:5050) is detected
> I0705 12:10:10.076956  9303 sched.cpp:326] New master detected at 
> [email protected]:5050
> I0705 12:10:10.077057  9303 sched.cpp:336] No credentials provided. 
> Attempting to register without authentication
> I0705 12:10:10.090709  9301 sched.cpp:703] Framework registered with 
> 13553f8b-f42c-4f20-88cd-16f1cc153ede-0001
> I verified it also by changing @transient lazy val log to def and it works as 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to