LIBPROCESS_IP didn't work for me, but I'm working off an older version of Mesos (the one listed in these instructions: https://github.com/mesos/spark/wiki/Running-spark-on-mesos) and I see there was a fix in Mesos recently regarding LIBPROCESS_IP (https://reviews.apache.org/r/4355/)
However, I found the culprit: /etc/hosts had an entry for the local ip: # Added by cloud-init 127.0.1.1 ip-10-252-94-24.us-west-2.compute.internal ip-10-252-94-24 I removed it, and now everything works. Thanks for your help! On Thu, Apr 19, 2012 at 9:56 PM, Matei Zaharia <[email protected]> wrote: > Good point there. Maybe libprocess (our communication layer) is using the > wrong address. I remember seeing that on ubuntu -- if you try to call > gethostbyname passing in the local hostname, you get back 127.0.1.1 instead > of the external IP. Try setting the LIBPROCESS_IP environment variable on the > slave to the "right" IP before you run mesos-slave. > > Matei > > On Apr 19, 2012, at 9:47 PM, Scott Smith wrote: > >> Well the logs say this: >> >> I0420 04:40:30.870983 8193 master.cpp:814] Attempting to register >> slave 201204200437-0-162 at [email protected]:51851 >> I0420 04:40:30.871330 8193 master.cpp:1057] Master now considering a >> slave at ip-10-252-94-24.us-west-2.compute.internal:51851 as active >> I0420 04:40:30.871415 8193 master.cpp:1588] Adding slave >> 201204200437-0-162 at ip-10-252-94-24.us-west-2.compute.internal with >> cpus=1; mem=1024 >> I0420 04:40:30.871599 8193 simple_allocator.cpp:71] Added slave >> 201204200437-0-162 with cpus=1; mem=1024 >> I0420 04:40:30.871680 8193 master.cpp:1143] Slave 201204200437-0-162 >> disconnected >> I0420 04:40:30.871819 8193 simple_allocator.cpp:83] Removed slave >> 201204200437-0-162 >> >> tcp dump says this: >> >> POST /master/mesos.internal.RegisterSlaveMessage HTTP/1.0 >> User-Agent: libprocess/[email protected]:51851 >> Connection: Keep-Alive >> Transfer-Encoding: chunked >> >> 87 >> >> .. >> *ip-10-252-94-24.us-west-2.compute.internal.*ip-10-252-94-24.us-west-2.compute.internal.. >> .cpus... .......?.. >> .mem... .......@ .? >> 0 >> >> >> >> so it looks like its reporting both a valid hostname and a loopback >> addr. Which will the master use? >> >> btw I have both machines in the same security group, and opened all >> tcp inbound for the group to the group. >> >> >> On Thu, Apr 19, 2012 at 9:42 PM, Matei Zaharia <[email protected]> >> wrote: >>> What hostname and port does the slave report for itself (i.e. when the >>> master sees it connect, what message does it print)? It could be that the >>> master cannot connect back to that address. Maybe you need to open up >>> communication among machines in your EC2 security groups. >>> >>> Matei >>> >>> On Apr 19, 2012, at 9:10 PM, Scott Smith wrote: >>> >>>> Direct IP/port. No zookeeper. >>>> On Apr 19, 2012 7:35 PM, "John Sirois" <[email protected]> wrote: >>>> >>>>> How are your slaves connecting to the master? Via zookeeper or via known >>>>> hostname/ip ? >>>>> >>>>> On Thursday, April 19, 2012, Scott Smith wrote: >>>>> >>>>>> I'm trying to set up a cluster on ec2, but not using the canned >>>>>> scripts/image. I built the latest svn on Ubuntu 11.10 amd64, and copied >>>>>> the build to a second node. Both are c1.medium instances (not that it >>>>>> should matter). No other software is running (no hdfs, no hadoop, etc). >>>>>> >>>>>> The problem I have is the slave repeatedly (approx once per second) >>>>>> connects, advertises its resources, gets added, and then disconnects. No >>>>>> reason is given for disconnecting. There are no messages on the slave, >>>>>> only 5 or 6 messages on the master. >>>>>> >>>>>> I'm not sure what the next diagnostic step should be; I was hoping >>>>> someone >>>>>> else ran into the same problem and could point out what I did wrong. Any >>>>>> advice? >>>>>> >>>>>> Thanks! >>>>>> >>>>> >>>>> >>>>> -- >>>>> John Sirois >>>>> 303-512-3301 >>>>> >>> >> >> >> >> -- >> Scott > -- Scott
