[jira] [Commented] (SPARK-12267) Standalone master keeps references to disassociated workers until they sent no heartbeats

Marcelo Vanzin (JIRA) Thu, 10 Dec 2015 19:13:55 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15052085#comment-15052085
 ]


Marcelo Vanzin commented on SPARK-12267:
----------------------------------------

Hi [~zsxwing], [~andrewor14],

I uploaded my code to https://github.com/vanzin/spark/tree/SPARK-12267. I 
didn't file a PR because I'm gonna be out tomorrow and can't really follow on 
it. But can one of you run the following test using Shixiong's PR and see 
whether that works? His code is simpler, and if it fixes the problems, we 
should use that for 1.6 instead of my change.

You'll need 4 terminals (adjust addresses as needed):

# run {{./bin/spark-class org.apache.spark.deploy.master.Master -p 7077 
--webui-port 8080 --properties-file /tmp/ha.conf}}
# run {{./bin/spark-class org.apache.spark.deploy.master.Master -p 17077 
--webui-port 18080 --properties-file /tmp/ha.conf}}
# run {{./bin/spark-class org.apache.spark.deploy.worker.Worker 
spark://<ip>:7077,<ip>:17077}}
# run {{./bin/spark-shell --master spark://<ip>:7077,<ip>:17077}}

My ha.conf: 

{code}
spark.deploy.recoveryMode=ZOOKEEPER
spark.deploy.zookeeper.url=<zookeeper_host>:2181
spark.deploy.zookeeper.dir=/spark
{code}

Once the shell is up and working, kill the active Master (in terminal 1 most 
probably). Wait for the timeout and make sure the app registers with the other 
master, and that the other master's UI works. Then kill the worker, make sure 
it goes away instantly in the active master's logs.

We really need a unit test for this. As far as I can tell none of the worker / 
app re-registration code in Master.scala is exercised by unit tests.

> Standalone master keeps references to disassociated workers until they sent 
> no heartbeats
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-12267
>                 URL: https://issues.apache.org/jira/browse/SPARK-12267
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0
>            Reporter: Jacek Laskowski
>
> While toying with Spark Standalone I've noticed the following messages
> in the logs of the master:
> {code}
> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
> INFO Master: localhost:59920 got disassociated, removing it.
> ...
> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> we got no heartbeat in 60 seconds
> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> on 192.168.1.6:59919
> {code}
> Why does the message "WARN Master: Removing
> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> 60 seconds" appear when the worker should've been removed already (as
> pointed out in "INFO Master: localhost:59920 got disassociated,
> removing it.")?
> Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?
> I started master using {{./sbin/start-master.sh -h localhost}} and the
> workers {{./sbin/start-slave.sh spark://localhost:7077}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-12267) Standalone master keeps references to disassociated workers until they sent no heartbeats

Reply via email to