Farm ID: 517

I just had a major catastrophe on my farm. I have 4 storage nodes that
run a Gluster filesystem for the other machines in my farm. It is
imperative that on HostInit each server mounts its appropriate EBS
volume so that the glusterfs daemon can start. I was very excited to
see the built in Scalr feature that handles this, previously I was
using my own script that worked reasonably well.

Today Scalr noticed that 3 of these nodes were down, so it created new
instances. However, when I logged in to see why none of my sites were
working. I went to the EBS Volumes page in Gluster and saw that the
volumes that are set to automatically mount for sto1-g2, sto2-g2, and
sto3-g2 were all listed as "Available"  This means Scalr was unable to
mount or didn't try to mount the appropriate volumes when new
instances of these roles came up after the old ones crashed. BTW, all
these roles explicitly only allow 1 running instance at a time because
I need specific EBS volumes mounted to them.

Here is an example of what I found in my Event Log:

29-03-2009 04:43:15     INFO    Main Farm       i-00056169/trap-hostup.sh
10.251.199.116 UP. Scalr notified me that 10.251.199.116 of role base
(Custom role: sto1-g2) is up.

29-03-2009 04:42:09     INFO    Main Farm       i-59197d30/trap-hostdown.sh
10.251.75.181 DOWN: Scalr notified me that 10.251.75.181 of role base
(Custom role: sto1-g2, I'm first: 0) is down

29-03-2009 04:40:08     WARN    Main Farm       PollerProcess   Disaster: No
instances running in role sto1-g2!

29-03-2009 04:38:09     ERROR   Main Farm       PollerProcess   Failed to
retrieve LA on instance i-51e58138 for 20 minutes. terminating
instance. Try increasing 'Terminate instance if cannot retrieve it's
status' setting on sto1-g2 configuration tab.

and in the Scripting Log I have a bunch of these:

2009-03-26 18:12:47     OnHostUp        Main Farm       i-8f42d9e6      Script 
'/usr/local/
bin/scalr-scripting.Gx28149/EBS_Mount' execution result (Execution
time: 7 seconds).
stdout: MY ROLE: sto1-g2
My INSTANCE: i-8f42d9e6
Volume is already attached!

2009-03-26 15:42:59     OnHostUp        Main Farm       i-8f42d9e6      Script 
'/usr/local/
bin/scalr-scripting.fn24850/EBS_Mount' execution result (Execution
time: 8 seconds).
stdout: MY ROLE: sto1-g2
My INSTANCE: i-8f42d9e6
Volume is already attached!

2009-03-26 15:42:37     OnHostUp        Main Farm       i-8f42d9e6      Script 
'/usr/local/
bin/scalr-scripting.tl24436/EBS_Mount' execution result (Execution
time: 9 seconds).
stdout: MY ROLE: sto1-g2
My INSTANCE: i-8f42d9e6
Volume is already attached!

I then thought to myself, great I forgot to turn off the old OnHostUp
EBS_Mount script and it is causing a conflict. Well, after visiting my
Farm Edit page I found that this was NOT the case. The EBS_Mount
script is not checked for any event for any role. I am guessing that I
just stumbled on some type of Scripting cache bug in Scalr and the
side effect is that my instances are not able to reattach their EBS
volumes using the new feature.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"scalr-discuss" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/scalr-discuss?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to