Well the logs say this: I0420 04:40:30.870983 8193 master.cpp:814] Attempting to register slave 201204200437-0-162 at [email protected]:51851 I0420 04:40:30.871330 8193 master.cpp:1057] Master now considering a slave at ip-10-252-94-24.us-west-2.compute.internal:51851 as active I0420 04:40:30.871415 8193 master.cpp:1588] Adding slave 201204200437-0-162 at ip-10-252-94-24.us-west-2.compute.internal with cpus=1; mem=1024 I0420 04:40:30.871599 8193 simple_allocator.cpp:71] Added slave 201204200437-0-162 with cpus=1; mem=1024 I0420 04:40:30.871680 8193 master.cpp:1143] Slave 201204200437-0-162 disconnected I0420 04:40:30.871819 8193 simple_allocator.cpp:83] Removed slave 201204200437-0-162
tcp dump says this: POST /master/mesos.internal.RegisterSlaveMessage HTTP/1.0 User-Agent: libprocess/[email protected]:51851 Connection: Keep-Alive Transfer-Encoding: chunked 87 .. *ip-10-252-94-24.us-west-2.compute.internal.*ip-10-252-94-24.us-west-2.compute.internal.. .cpus... .......?.. .mem... .......@ .? 0 so it looks like its reporting both a valid hostname and a loopback addr. Which will the master use? btw I have both machines in the same security group, and opened all tcp inbound for the group to the group. On Thu, Apr 19, 2012 at 9:42 PM, Matei Zaharia <[email protected]> wrote: > What hostname and port does the slave report for itself (i.e. when the master > sees it connect, what message does it print)? It could be that the master > cannot connect back to that address. Maybe you need to open up communication > among machines in your EC2 security groups. > > Matei > > On Apr 19, 2012, at 9:10 PM, Scott Smith wrote: > >> Direct IP/port. No zookeeper. >> On Apr 19, 2012 7:35 PM, "John Sirois" <[email protected]> wrote: >> >>> How are your slaves connecting to the master? Via zookeeper or via known >>> hostname/ip ? >>> >>> On Thursday, April 19, 2012, Scott Smith wrote: >>> >>>> I'm trying to set up a cluster on ec2, but not using the canned >>>> scripts/image. I built the latest svn on Ubuntu 11.10 amd64, and copied >>>> the build to a second node. Both are c1.medium instances (not that it >>>> should matter). No other software is running (no hdfs, no hadoop, etc). >>>> >>>> The problem I have is the slave repeatedly (approx once per second) >>>> connects, advertises its resources, gets added, and then disconnects. No >>>> reason is given for disconnecting. There are no messages on the slave, >>>> only 5 or 6 messages on the master. >>>> >>>> I'm not sure what the next diagnostic step should be; I was hoping >>> someone >>>> else ran into the same problem and could point out what I did wrong. Any >>>> advice? >>>> >>>> Thanks! >>>> >>> >>> >>> -- >>> John Sirois >>> 303-512-3301 >>> > -- Scott
