[
https://issues.apache.org/jira/browse/CLOUDSTACK-3163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13712294#comment-13712294
]
Wido den Hollander commented on CLOUDSTACK-3163:
------------------------------------------------
So, I took a quick peek at how this works and I see it does about 13 calls on
my set up, of which 11 are calling "vmdata.sh" with different parameters.
I think that can be brought back to one call, bringing the total (in my setup)
back to 3 instead of 13.
I'll see if I can find the time to test this out.
> KVM Virtual Router startup time is painfully long
> -------------------------------------------------
>
> Key: CLOUDSTACK-3163
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3163
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: KVM
> Affects Versions: pre-4.0.0
> Environment: CloudPlatform 3.0.3, but I don't see any changes to the
> relevant code (I think) on master
> Reporter: Andrew Bayer
> Priority: Critical
>
> When you've got a couple thousand instances, spread across 10 or so pods,
> virtual router startup time is near crippling - actually, if you don't enable
> the option to have virtual routers only populated with instances in their
> pod, it *is* crippling, in that the virtual routers don't finish starting
> before the management server decides they've timed out and tries to start a
> new one.
> This seems to be the result of a few painful inefficiencies:
> - The same codepath is followed whether you're adding a new instance to an
> already running VR, or adding two hundred already running instances to a new
> VR. So each ssh/scp/sed/cp/chmod/etc command is replicated for each instance,
> rather than finding efficiencies by doing things across the whole set of
> instances.
> - But what really eats up the time is the population of vm data - for each
> piece of vm data (which, from a rough look at the code, seems to be something
> like 10 or 11 data files), there are something like 7 ssh calls and an scp
> call. So that means that per instance, we have somewhere around 80 to 90
> ssh/scp calls, plus the single ssh call for dhcp_entry.sh. So with 200
> instances, that's 1600 to 1800 ssh/scp calls on a single VR, with all the
> overhead entailed in opening that many ssh connections, starting bash, etc,
> etc... Given that in my experience, a VR with ~200 instances takes ~90
> minutes to start up (I may be misremembering slightly - it could be ~200
> instances takes closer to 60 minutes, and ~300 takes closer to 90), that
> works out to 3 seconds or so per ssh/scp, which doesn't seem implausible to
> me.
> So, this shouldn't be this way. At a minimum, there's no reason not to
> offload the whole process from a script run on the host making repeated ssh
> calls to the VR to a script on the VR that gets called from the host, albeit
> possibly a temporary one that's generated on the fly and copied over to the
> VR. That alone would probably save most of the VR startup time, just by
> dropping the number of ssh/scp connections per instance from 80-90 to 3
> (dhcp_entry.sh call, scp of temporary script, execution of temporary script).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira