Hi, Sorry for taking such long time to chime in but these mails were sadly missed. Please see my inline comments below. My original concerns for the revert of the service were as follows:
1. What do we do about existing installation. This support was added at the end of Havana and it is in production. 2. I had concerns regarding the way in which the image cache would be maintained - that is each compute node has its own cache directory. So this may have had datastore issues. Over the last few weeks I have encountered some serious problems with the multi VC support. This is causing production setups to break (https://review.openstack.org/108225 is an example - this is due to https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L3368 ). This is due to the fact that the node may be updated at random places in the nova manager code (these may be bugs - but it does not work well with the multi cluster support). There are too many edge cases here and the code is not robust enough. If we do decide to go ahead with dropping the support, then we need to do the following: 1. Upgrade path: we need to have a well defined upgrade path that will enable an existing setup to upgrade from I to J (I do not think that we should leave this till K as there are too many pinpoints with the node management). 2. We need to make a few tweaks to the image cache path. My original concern was that each compute node has its own cache directory. After giving it some though this will be ok as long as we have each compute host using the same cache directory. The reason for this is that the locking for image handling is done external on the file system (https://github.com/openstack/nova/blob/master/nova/virt/vmwareapi/vmops.py #L319). So if we have multiple compute processes running on the same host then we are good. In addition to this we can make use of a shared files system and then we can have all compute nodes use the shared file system for the locking - win win :). If anyone gets to this stage in the thread then please see a fix for object support and aging (https://review.openstack.org/111996 - the object updates made earlier int he cycle caused a few problems - but I guess that the gate does not wait 24 hours to purge instances). In short I am in favor of removing the multi cluster support but we need to do the following: 1. Upgrade path 2. Investigate memory issues with nova compute 3. Tweak image cache path Thanks Gary On 7/15/14, 11:36 AM, "Matthew Booth" <[email protected]> wrote: >On 14/07/14 09:34, Vaddi, Kiran Kumar wrote: >> Hi, >> >> >> >> In the Juno summit, it was discussed that the existing approach of >> managing multiple VMware Clusters using a single nova compute service is >> not preferred and the approach of one nova compute service representing >> one cluster should be looked into. >> >> >> >> We would like to retain the existing approach (till we have resolved the >> issues) for the following reasons: >> >> >> >> 1. Even though a single service is managing all the clusters, >> logically it is still one compute per cluster. To the scheduler each >> cluster is represented as individual computes. Even in the driver each >> cluster is represented separately. This is something that would not change with dropping the multi cluster support. The only change here is that additional processes will be running (please see below). >> >> >> >> 2. Since ESXi does not allow to run nova-compute service on the >> hypervisor unlike KVM, the service has to be run externally on a >> different server. Its easier from administration perspective to manage a >> single service than multiple. Yes, you have a good point here, but I think that at the end of the day we need a robust service and that service will be managed by external tools, for example chef, puppet etc. Unless it is a very small cloud. >> >> >> 3. Every connection to vCenter uses up ~140MB in the driver. If we >> were to manage each cluster by an individual service the memory consumed >> for 32 clusters will be high (~4GB). The newer versions support 64 >>clusters! I think that this is a bug and it needs to be fixed. I understand that this may affect a decision from today to tomorrow but it is not an architectural issue and can be resolved (and really should be resolved ASAP). I think that we need to open a bug for this and we should start to investigate - fixing this will enable whoever is running a service uses those resources elsewhere :) >> >> >> >> 4. There are existing customer installations that use the existing >> approach and therefore not enforce the new approach until it is simple >> to manage and not resource intensive. >> >> >> >> If the admin wants to use one service per cluster, it can be done with >> the existing driver. In the conf the admin has to specify a single >> cluster instead of a list of clusters. Therefore its better to give the >> admins the choice rather than enforcing one type of deployment. This is a real pain point which we should address. I think that we have more serious issues than that - things break with the current support. One example is: https://review.openstack.org/108225. In short if an admin is running more than one compute node with a different cluster configured in each compute node then things start to break. > >Does anybody recall the detail of why we wanted to remove this? There >was unease over use of instance's node field in the db, but I don't >recall why. > >Matt > >-- >Matthew Booth >Red Hat Engineering, Virtualisation Team > >Phone: +442070094448 (UK) >GPG ID: D33C3490 >GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 > >_______________________________________________ >OpenStack-dev mailing list >[email protected] >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev _______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
