As mentioned previously, I'm still having a problem whereby when I ask
Scalr to launch one additional instance for a role (for which there is
1 instance running and the min=max=1) it can spin up 2 or more
additional ones (for example, it launched 8 today) before deciding to
terminate all by the additional one requested after ~45mins.

Having performed several runs with additional debug logging inserted
and viewing the logs, I have a theory as to what is happening.
However, I'm not 100% certain (and viewing the logs is a little
confusing as they often don't appear in timestamp order).

Here is the sequence of events I believe may happen:

1. From the farm roles_view page I request one additional instance.
1a. the POST request handler increments the min count for the role
1b. Scalr::RunInstance is called
1c. EC2 RunInstances is called
1d. A new instances is added into the instances DB table with status
'pending'

All of that seems to take 2-3 minutes to complete.  So, while that is
happening:

2. The Poller cronjob is run (scheduled every 1min, so can run several
times before the instance launches):
2a. It notices that the number of instances running is < min count
(and sets need_new_instance=true)
2b. It starts a new instance - *which it should not*

Again, the call to RunInstances in 2b only adds the DB entry for the
instance in pending state *after* the call to EC2 RunInstances
completes - which may be 2-3 mins during which time the cron job is
executes again.


As I said, I'm not 100% of the reasoning above.  Next time I find time
to sit down and have another debugging session, I'll update this
thread if I find something different.  I can't quite explain how those
steps manage to cascade to cause 8 instances to be launched.
I'm also wondering if this is related to the DNS Zone update failure
I'm seeing (which claims it can't update the DNS Zone because Cron has
locked the zone - which might be true if there are several Poller
cronjobs running in parallel if they're scheduled every 1 min, but
taking 2-3 mins to execute).

If any of the developers can shed any light on the logic here, that
would be welcome.
I plan to change my Poller cron to run every 4mins to see if that
effects the issue.
Once I understand it fully I'll be in a position to suggest a fix.

Thanks,
-Cenji.


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"scalr-discuss" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/scalr-discuss?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to