Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

Yana Kadiyska Wed, 27 May 2015 17:50:09 -0700

What does your master log say -- normally the master should NEVER shut
down...you should be able to spark-submit to infinity with no issues...So
the question about high variance on upstart is one issue, but the other
thing that's puzzling to me is why your master is ever down to begin
with...(assuming you're not manually restarting it and I missed that part).


One thing that does occur to me, perhaps since you're running on the same
box -- is it possible that the UNIX OOM killer is getting your process --
we had that occur with a long running driver a few times. It's a nasty one
as it leaves no good trace in the spark logs, you have to know to look at
the system logs.

If you're running spark-submit by hand, you can verify via the UI that your
master is up and has a worker connected prior to submitting...if it's not
investigate why it went down...

Lastly a random thought -- your "memory per Node" shows up at 512MB which
seems really really low on the executor side. I don't know how much memory
you have on that machine, but normally the executors do all the work- I'd
try to give it a few G if you can. Your worker shows 15G of memory, I'd
give your executor at least 4...

On Wed, May 27, 2015 at 4:06 PM, Stephen Boesch <java...@gmail.com> wrote:

> Here is example after git clone-ing latest 1.4.0-SNAPSHOT.  The first 3
> runs (FINISHED) were successful and connected quickly.  Fourth run (ALIVE)
> is failing on connection/association.
>
>
> URL: spark://mellyrn.local:7077
> REST URL: spark://mellyrn.local:6066 (cluster mode)
> Workers: 1
> Cores: 8 Total, 0 Used
> Memory: 15.0 GB Total, 0.0 B Used
> Applications: 0 Running, 3 Completed
> Drivers: 0 Running, 0 Completed
> Status: ALIVE
> Workers
>
> Worker Id Address ▾ State Cores Memory
> worker-20150527122155-10.0.0.3-60847 10.0.0.3:60847 ALIVE 8 (0 Used) 15.0
> GB (0.0 B Used)
> Running Applications
>
> Application ID Name Cores Memory per Node Submitted Time User State
> Duration
> Completed Applications
>
> Application ID Name Cores Memory per Node Submitted Time User State
> Duration
> app-20150527125945-0002 TestRunner: power-iteration-clustering 8 512.0 MB 
> 2015/05/27
> 12:59:45 steve FINISHED 7 s
> app-20150527124403-0001 TestRunner: power-iteration-clustering 8 512.0 MB 
> 2015/05/27
> 12:44:03 steve FINISHED 6 s
> app-20150527123822-0000 TestRunner: power-iteration-clustering 8 512.0 MB 
> 2015/05/27
> 12:38:22 steve FINISHED 6 s
>
>
>
> 2015-05-27 11:42 GMT-07:00 Stephen Boesch <java...@gmail.com>:
>
> Thanks Yana,
>>
>>    My current experience here is after running some small spark-submit
>> based tests the Master once again stopped being reachable.  No change in
>> the test setup.  I restarted Master/Worker and still not reachable.
>>
>> What might be the variables here in which association with the
>> Master/Worker stops succeedng?
>>
>> For reference here are the Master/worker
>>
>>
>>   501 34465     1   0 11:35AM ??         0:06.50
>> /Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/bin/java
>> -cp <classpath..> -Xms512m -Xmx512m -XX:MaxPermSize=128m
>> org.apache.spark.deploy.worker.Worker spark://mellyrn.local:7077
>>   501 34361     1   0 11:35AM ttys018    0:07.08
>> /Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/bin/java
>> -cp <classpath..>  -Xms512m -Xmx512m -XX:MaxPermSize=128m
>> org.apache.spark.deploy.master.Master --ip mellyrn.local --port 7077
>> --webui-port 8080
>>
>>
>> 15/05/27 11:36:37 INFO SparkUI: Started SparkUI at
>> http://25.101.19.24:4040
>> 15/05/27 11:36:37 INFO SparkContext: Added JAR
>> file:/shared/spark-perf/mllib-tests/target/mllib-perf-tests-assembly.jar at
>> http://25.101.19.24:60329/jars/mllib-perf-tests-assembly.jar with
>> timestamp 1432751797662
>> 15/05/27 11:36:37 INFO AppClient$ClientActor: Connecting to master
>> akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
>> 15/05/27 11:36:37 WARN AppClient$ClientActor: Could not connect to
>> akka.tcp://sparkMaster@mellyrn.local:7077:
>> akka.remote.InvalidAssociation: Invalid address:
>> akka.tcp://sparkMaster@mellyrn.local:7077
>> 15/05/27 11:36:37 WARN Remoting: Tried to associate with unreachable
>> remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
>> now gated for 5000 ms, all messages to this address will be delivered to
>> dead letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077
>> 15/05/27 11:36:57 INFO AppClient$ClientActor: Connecting to master
>> akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
>> 15/05/27 11:36:57 WARN AppClient$ClientActor: Could not connect to
>> akka.tcp://sparkMaster@mellyrn.local:7077:
>> akka.remote.InvalidAssociation: Invalid address:
>> akka.tcp://sparkMaster@mellyrn.local:7077
>> 15/05/27 11:36:57 WARN Remoting: Tried to associate with unreachable
>> remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
>> now gated for 5000 ms, all messages to this address will be delivered to
>> dead letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077
>> 15/05/27 11:37:17 INFO AppClient$ClientActor: Connecting to master
>> akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
>> 15/05/27 11:37:17 WARN AppClient$ClientActor: Could not connect to
>> akka.tcp://sparkMaster@mellyrn.local:7077:
>> akka.remote.InvalidAssociation: Invalid address:
>> akka.tcp://sparkMaster@mellyrn.local:7077
>> 15/05/27 11:37:17 WARN Remoting: Tried to associate with unreachable
>> remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
>> now gated for 5000 ms, all messages to this address will be delivered to
>> dead letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077
>> 15/05/27 11:37:37 ERROR SparkDeploySchedulerBackend: Application has been
>> killed. Reason: All masters are unresponsive! Giving up.
>> 15/05/27 11:37:37 WARN SparkDeploySchedulerBackend: Application ID is not
>> initialized yet.
>> 1
>>
>>
>> Even when successful, the time for the Master to come up has a
>> surprisingly high variance. I am running on a single machine for which
>> there is plenty of RAM. Note that was one problem before the present series
>> :  if RAM is tight then the failure modes can be unpredictable. But now the
>> RAM is not an issue: plenty available for both Master and Worker.
>>
>> Within the same hour period and starting/stopping maybe a dozen times,
>> the startup time for the Master may be a few seconds up to  a couple to
>> several minutes.
>>
>> 2015-05-20 7:39 GMT-07:00 Yana Kadiyska <yana.kadiy...@gmail.com>:
>>
>> But if I'm reading his email correctly he's saying that:
>>>
>>> 1. The master and slave are on the same box (so network hiccups are
>>> unlikely culprit)
>>> 2. The failures are intermittent -- i.e program works for a while then
>>> worker gets disassociated...
>>>
>>> Is it possible that the master restarted? We used to have problems like
>>> this where we'd restart the master process, it won't be listening on 7077
>>> for some time, but the worker process is trying to connect and by the time
>>> the master is up the worker has given up...
>>>
>>>
>>> On Wed, May 20, 2015 at 5:16 AM, Evo Eftimov <evo.efti...@isecc.com>
>>> wrote:
>>>
>>>> Check whether the name can be resolved in the /etc/hosts file (or DNS)
>>>> of the worker
>>>>
>>>>
>>>>
>>>> (the same btw applies for the Node where you run the driver app – all
>>>> other nodes must be able to resolve its name)
>>>>
>>>>
>>>>
>>>> *From:* Stephen Boesch [mailto:java...@gmail.com]
>>>> *Sent:* Wednesday, May 20, 2015 10:07 AM
>>>> *To:* user
>>>> *Subject:* Intermittent difficulties for Worker to contact Master on
>>>> same machine in standalone
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> What conditions would cause the following delays / failure for a
>>>> standalone machine/cluster to have the Worker contact the Master?
>>>>
>>>>
>>>>
>>>> 15/05/20 02:02:53 INFO WorkerWebUI: Started WorkerWebUI at
>>>> http://10.0.0.3:8081
>>>>
>>>> 15/05/20 02:02:53 INFO Worker: Connecting to master
>>>> akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
>>>>
>>>> 15/05/20 02:02:53 WARN Remoting: Tried to associate with unreachable
>>>> remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
>>>> now gated for 5000 ms, all messages to this address will be delivered to
>>>> dead letters. Reason: Connection refused: mellyrn.local/10.0.0.3:7077
>>>>
>>>> 15/05/20 02:03:04 INFO Worker: Retrying connection to master (attempt #
>>>> 1)
>>>>
>>>> ..
>>>>
>>>> ..
>>>>
>>>> 15/05/20 02:03:26 INFO Worker: Retrying connection to master (attempt #
>>>> 3)
>>>>
>>>> 15/05/20 02:03:26 INFO Worker: Connecting to master
>>>> akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
>>>>
>>>> 15/05/20 02:03:26 WARN Remoting: Tried to associate with unreachable
>>>> remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
>>>> now gated for 5000 ms, all messages to this address will be delivered to
>>>> dead letters. Reason: Connection refused: mellyrn.local/10.0.0.3:7077
>>>>
>>>
>>>
>>
>

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

Reply via email to