LIBPROCESS_IP didn't work for me, but I'm working off an older version
of Mesos (the one listed in these instructions:
https://github.com/mesos/spark/wiki/Running-spark-on-mesos) and I see
there was a fix in Mesos recently regarding LIBPROCESS_IP
(https://reviews.apache.org/r/4355/)

However, I found the culprit:  /etc/hosts had an entry for the local ip:

# Added by cloud-init
127.0.1.1       ip-10-252-94-24.us-west-2.compute.internal ip-10-252-94-24

I removed it, and now everything works.  Thanks for your help!

On Thu, Apr 19, 2012 at 9:56 PM, Matei Zaharia <[email protected]> wrote:
> Good point there. Maybe libprocess (our communication layer) is using the 
> wrong address. I remember seeing that on ubuntu -- if you try to call 
> gethostbyname passing in the local hostname, you get back 127.0.1.1 instead 
> of the external IP. Try setting the LIBPROCESS_IP environment variable on the 
> slave to the "right" IP before you run mesos-slave.
>
> Matei
>
> On Apr 19, 2012, at 9:47 PM, Scott Smith wrote:
>
>> Well the logs say this:
>>
>> I0420 04:40:30.870983  8193 master.cpp:814] Attempting to register
>> slave 201204200437-0-162 at [email protected]:51851
>> I0420 04:40:30.871330  8193 master.cpp:1057] Master now considering a
>> slave at ip-10-252-94-24.us-west-2.compute.internal:51851 as active
>> I0420 04:40:30.871415  8193 master.cpp:1588] Adding slave
>> 201204200437-0-162 at ip-10-252-94-24.us-west-2.compute.internal with
>> cpus=1; mem=1024
>> I0420 04:40:30.871599  8193 simple_allocator.cpp:71] Added slave
>> 201204200437-0-162 with cpus=1; mem=1024
>> I0420 04:40:30.871680  8193 master.cpp:1143] Slave 201204200437-0-162
>> disconnected
>> I0420 04:40:30.871819  8193 simple_allocator.cpp:83] Removed slave
>> 201204200437-0-162
>>
>> tcp dump says this:
>>
>> POST /master/mesos.internal.RegisterSlaveMessage HTTP/1.0
>> User-Agent: libprocess/[email protected]:51851
>> Connection: Keep-Alive
>> Transfer-Encoding: chunked
>>
>> 87
>>
>> ..
>> *ip-10-252-94-24.us-west-2.compute.internal.*ip-10-252-94-24.us-west-2.compute.internal..
>> .cpus...              .......?..
>> .mem...               .......@ .?
>> 0
>>
>>
>>
>> so it looks like its reporting both a valid hostname and a loopback
>> addr.  Which will the master use?
>>
>> btw I have both machines in the same security group, and opened all
>> tcp inbound for the group to the group.
>>
>>
>> On Thu, Apr 19, 2012 at 9:42 PM, Matei Zaharia <[email protected]> 
>> wrote:
>>> What hostname and port does the slave report for itself (i.e. when the 
>>> master sees it connect, what message does it print)? It could be that the 
>>> master cannot connect back to that address. Maybe you need to open up 
>>> communication among machines in your EC2 security groups.
>>>
>>> Matei
>>>
>>> On Apr 19, 2012, at 9:10 PM, Scott Smith wrote:
>>>
>>>> Direct IP/port.  No zookeeper.
>>>> On Apr 19, 2012 7:35 PM, "John Sirois" <[email protected]> wrote:
>>>>
>>>>> How are your slaves connecting to the master?  Via zookeeper or via known
>>>>> hostname/ip ?
>>>>>
>>>>> On Thursday, April 19, 2012, Scott Smith wrote:
>>>>>
>>>>>> I'm trying to set up a cluster on ec2, but not using the canned
>>>>>> scripts/image.  I built the latest svn on Ubuntu 11.10 amd64, and copied
>>>>>> the build to a second node.  Both are c1.medium instances (not that it
>>>>>> should matter).  No other software is running (no hdfs, no hadoop, etc).
>>>>>>
>>>>>> The problem I have is the slave repeatedly (approx once per second)
>>>>>> connects, advertises its resources, gets added, and then disconnects.  No
>>>>>> reason is given for disconnecting.  There are no messages on the slave,
>>>>>> only 5 or 6 messages on the master.
>>>>>>
>>>>>> I'm not sure what the next diagnostic step should be; I was hoping
>>>>> someone
>>>>>> else ran into the same problem and could point out what I did wrong.  Any
>>>>>> advice?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> John Sirois
>>>>> 303-512-3301
>>>>>
>>>
>>
>>
>>
>> --
>>         Scott
>



-- 
        Scott

Reply via email to