Farm ID: 517 I just had a major catastrophe on my farm. I have 4 storage nodes that run a Gluster filesystem for the other machines in my farm. It is imperative that on HostInit each server mounts its appropriate EBS volume so that the glusterfs daemon can start. I was very excited to see the built in Scalr feature that handles this, previously I was using my own script that worked reasonably well.
Today Scalr noticed that 3 of these nodes were down, so it created new instances. However, when I logged in to see why none of my sites were working. I went to the EBS Volumes page in Gluster and saw that the volumes that are set to automatically mount for sto1-g2, sto2-g2, and sto3-g2 were all listed as "Available" This means Scalr was unable to mount or didn't try to mount the appropriate volumes when new instances of these roles came up after the old ones crashed. BTW, all these roles explicitly only allow 1 running instance at a time because I need specific EBS volumes mounted to them. Here is an example of what I found in my Event Log: 29-03-2009 04:43:15 INFO Main Farm i-00056169/trap-hostup.sh 10.251.199.116 UP. Scalr notified me that 10.251.199.116 of role base (Custom role: sto1-g2) is up. 29-03-2009 04:42:09 INFO Main Farm i-59197d30/trap-hostdown.sh 10.251.75.181 DOWN: Scalr notified me that 10.251.75.181 of role base (Custom role: sto1-g2, I'm first: 0) is down 29-03-2009 04:40:08 WARN Main Farm PollerProcess Disaster: No instances running in role sto1-g2! 29-03-2009 04:38:09 ERROR Main Farm PollerProcess Failed to retrieve LA on instance i-51e58138 for 20 minutes. terminating instance. Try increasing 'Terminate instance if cannot retrieve it's status' setting on sto1-g2 configuration tab. and in the Scripting Log I have a bunch of these: 2009-03-26 18:12:47 OnHostUp Main Farm i-8f42d9e6 Script '/usr/local/ bin/scalr-scripting.Gx28149/EBS_Mount' execution result (Execution time: 7 seconds). stdout: MY ROLE: sto1-g2 My INSTANCE: i-8f42d9e6 Volume is already attached! 2009-03-26 15:42:59 OnHostUp Main Farm i-8f42d9e6 Script '/usr/local/ bin/scalr-scripting.fn24850/EBS_Mount' execution result (Execution time: 8 seconds). stdout: MY ROLE: sto1-g2 My INSTANCE: i-8f42d9e6 Volume is already attached! 2009-03-26 15:42:37 OnHostUp Main Farm i-8f42d9e6 Script '/usr/local/ bin/scalr-scripting.tl24436/EBS_Mount' execution result (Execution time: 9 seconds). stdout: MY ROLE: sto1-g2 My INSTANCE: i-8f42d9e6 Volume is already attached! I then thought to myself, great I forgot to turn off the old OnHostUp EBS_Mount script and it is causing a conflict. Well, after visiting my Farm Edit page I found that this was NOT the case. The EBS_Mount script is not checked for any event for any role. I am guessing that I just stumbled on some type of Scripting cache bug in Scalr and the side effect is that my instances are not able to reattach their EBS volumes using the new feature. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "scalr-discuss" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/scalr-discuss?hl=en -~----------~----~----~----~------~----~------~--~---
