Re: Problem with CLOUDSTACK-10240 (Cannot migrate local volume to shared storage)

Tutkowski, Mike Mon, 16 Jul 2018 16:26:22 -0700

Yeah, I just meant that was a workaround. As you pointed out, that workaround 
doesn’t make use of the migrateVirtualMachineWithVolume API command, though.


On 7/16/18, 5:23 PM, "Rafael Weingärtner" <rafaelweingart...@gmail.com> wrote:

    Thanks for the answers Mike. I will not be able to do it today, but I will
    manage to do it this week. There is only one last doubt.
    
    [Mike] At least for KVM, you can shut the VM down and perform an offline
    migration
    of the volume from managed storage to non-managed storage. It’s possible we
    may
    support such a similar behavior with other hypervisor types in the future.
    
    [Rafael] I guess that we can shut down XenServer VMs and then migrate the
    volumes later, right? However, the method in question here
    (migrateVirtualMachineWithVolume) is not supposed to execute such steps, is
    it?
    
    
    On Mon, Jul 16, 2018 at 8:17 PM, Tutkowski, Mike <mike.tutkow...@netapp.com>
    wrote:
    
    >        - So, managed storage can be cluster and zone wide. Is that 
correct?
    >
    > [Mike] Correct
    >
    >        - If I want to migrate a VM across clusters, but if at least one of
    > its
    >        volumes is placed in a cluster-wide managed storage, the migration
    > is not
    >        allowed. Is that it?
    >
    > [Mike] Correct
    >
    >        - A volume placed in managed storage can never (at least not using
    > this
    >        migrateWithVolume method) be migrated out of the storage pool it
    > resides.
    >        is this statement right? Do you have alternative/other execution
    > flow
    >        regarding this scenario?
    >
    > [Mike] At least for KVM, you can shut the VM down and perform an offline
    > migration
    > of the volume from managed storage to non-managed storage. It’s possible
    > we may
    > support such a similar behavior with other hypervisor types in the future.
    >
    >        - When migrating a VM that does not have volumes in managed
    > storage, it
    >        should be possible to migrate it cross clusters. Therefore, we
    > should try
    >        to use the volume allocators to find a suitable storage pool for 
its
    >        volumes in the target cluster
    >
    > [Mike] It’s OK here if one or more of the volumes is on managed storage.
    > The “trick” is
    > that it needs to be on zone-wide managed storage that is visible to both
    > the source and
    > destination compute clusters. You cannot specify a new storage pool for
    > any of these volumes
    > (each must remain on its current, zone-wide primary storage).
    >
    > If you can add these new constraints into the code, I can review them
    > later. I’m a bit
    > pressed for time this week, so it might not be possible to do so right
    > away. Thanks!
    >
    > On 7/16/18, 3:52 PM, "Rafael Weingärtner" <rafaelweingart...@gmail.com>
    > wrote:
    >
    >     Thanks for your feedback Mike. I actually did not want to change this
    >     “migrateVirtualMachineWithVolume” API method. Everything started when
    > we
    >     wanted to create a feature to allow volume placement overrides. This
    > means,
    >     allowing root admins to place/migrate the volume to a storage pool 
that
    >     might not be “allowed” (according to its current disk offering). This
    >     feature was later expanded to allow changing the disk offering while
    >     executing a storage migration (this means allowing changes on volume’s
    >     QoS). Thus, creating a mechanism within ACS to allow disk offerings
    >     replacement (as opposed to DB intervention, which was the way it was
    > being
    >     done so far). The rationale behind these extensions/enhancement is
    > that the
    >     root admins are wise/experts (at least we expect them to be).
    > Therefore,
    >     they know what they are doing when overriding or replacing a disk
    > offering
    >     of a user.
    >
    >     So, why am I changing this “migrateVirtualMachineWithVolume” API
    > method?
    >     When we allowed that override procedure, it broke the migration of VMs
    > that
    >     had volumes initially placed in NFS and then replaced (via override) 
in
    >     local storage. It had something to do with the way ACS was detecting
    > if the
    >     VM has a local storage. Then, when I went to the method to fix it; it
    > was
    >     very convoluted to read and understand. Therefore, I re-wrote, and I
    > missed
    >     your use case. I am sorry for that. Moreover, I do intend to keep with
    > the
    >     current code, as we already have other features developed on top of
    > it, and
    >     this code is well documented and unit tested. It is only a matter of
    > adding
    >     your requirement there.
    >
    >     Now, let’s fix the problem. I will not point code here. I only want to
    >     understand the idea for now.
    >
    >        - So, managed storage can be cluster and zone wide. Is that 
correct?
    >        - If I want to migrate a VM across clusters, but if at least one of
    > its
    >        volumes is placed in a cluster-wide managed storage, the migration
    > is not
    >        allowed. Is that it?
    >        - A volume placed in managed storage can never (at least not using
    > this
    >        migrateWithVolume method) be migrated out of the storage pool it
    > resides.
    >        is this statement right? Do you have alternative/other execution
    > flow
    >        regarding this scenario?
    >        - When migrating a VM that does not have volumes in managed
    > storage, it
    >        should be possible to migrate it cross clusters. Therefore, we
    > should try
    >        to use the volume allocators to find a suitable storage pool for 
its
    >        volumes in the target cluster
    >
    >     Are these all of the use cases that were left behind?
    >
    >     On Mon, Jul 16, 2018 at 5:36 PM, Tutkowski, Mike <
    > mike.tutkow...@netapp.com>
    >     wrote:
    >
    >     > For your feature, Rafael, are you trying to support the migration of
    > a VM
    >     > that has local storage from one cluster to another or is
    > intra-cluster
    >     > migration of local storage sufficient?
    >     >
    >     > There is the migrateVolume API (you can pass in “live migrate”
    > parameter):
    >     >
    >     > http://cloudstack.apache.org/api/apidocs-4.11/apis/
    > migrateVolume.html
    >     >
    >     > There is also the migrateVirtualMachineWithVolume (one or more
    > volumes).
    >     > This is especially useful for moving a VM with its storage from one
    > cluster
    >     > to another:
    >     >
    >     > http://cloudstack.apache.org/api/apidocs-4.11/apis/
    >     > migrateVirtualMachineWithVolume.html
    >     >
    >     > On 7/16/18, 2:20 PM, "Tutkowski, Mike" <mike.tutkow...@netapp.com>
    > wrote:
    >     >
    >     >     Actually, I think I answered both of your questions with these
    > two
    >     > prior e-mails. Please let me know if you need further clarification.
    > Thanks!
    >     >
    >     >     On 7/16/18, 2:17 PM, "Tutkowski, Mike" <
    > mike.tutkow...@netapp.com>
    >     > wrote:
    >     >
    >     >         Allow me to correct what I said here:
    >     >
    >     >         “If getDefaultMappingOfVolumesAndStoragePoolForMigration is
    >     > invoked, we silently ignore the (faulty) input (which is a new
    > storage
    >     > pool) from the user and keep the volume in its same managed storage
    > pool
    >     > (the user may wonder why it wasn’t migrated if they don’t get an
    > error
    >     > message back telling them this is not allowed).”
    >     >
    >     >         I should have said the following:
    >     >
    >     >         If getDefaultMappingOfVolumesAndStoragePoolForMigration is
    >     > invoked on a VM that is using managed storage that is only at the
    > cluster
    >     > level (managed storage can be at either the zone or cluster level)
    > and we
    >     > are trying to migrate the VM from one cluster to another, this
    > operation
    >     > should fail (as the old code detects). The new code tries to keep 
the
    >     > volume in the same storage pool (but that storage pool will not be
    > visible
    >     > to the hosts in the destination compute cluster).
    >     >
    >     >         On 7/16/18, 2:10 PM, "Tutkowski, Mike" <
    > mike.tutkow...@netapp.com>
    >     > wrote:
    >     >
    >     >             Let me answer the questions in two separate e-mails.
    >     >
    >     >             This answer deals with what you wrote about this code:
    >     >
    >     >                 > if (destPool.getId() == currentPool.getId()) {
    >     >                 >     volumeToPoolObjectMap.put(volume,
    > currentPool);
    >     >                 > } else {
    >     >                 >      throw new CloudRuntimeException("Currently, a
    >     > volume on managed
    >     >                 > storage can only be 'migrated' to itself.");
    >     >                 > }
    >     >                 >
    >     >
    >     >             The code above is invoked if the user tries to migrate a
    >     > volume that’s on managed storage to another storage pool. At
    > present, such
    >     > volumes can be migrated when a VM is migrated from one compute
    > cluster to
    >     > another, but those volumes have to remain on the same managed
    > storage.
    >     >
    >     >             Here’s an example:
    >     >
    >     >             Let’s say VM_1 is in Cluster_1. VM_1 has a root (or
    > data) disk
    >     > on managed storage. We try to migrate the VM from Cluster_1 to
    > Cluster_2
    >     > and specify a new storage pool for the volume. This case should
    > fail. To
    >     > make it work, you need to either 1) not specify a new storage pool
    > or 2)
    >     > specify the same storage pool the volume is already in. If the
    > managed
    >     > storage in question is zone wide, then it can be used from both
    > Cluster_1
    >     > and Cluster_2.
    >     >
    >     >             The new code might call getDefaultMappingOfVolumesAndS
    > toragePoolForMigration
    >     > (if no storage pools at all are passed in to the API) or it might
    > call
    >     > createMappingVolumeAndStoragePoolEnteredByUser.
    >     >
    >     >             If getDefaultMappingOfVolumesAndStoragePoolForMigration
    > is
    >     > invoked, we silently ignore the (faulty) input (which is a new
    > storage
    >     > pool) from the user and keep the volume in its same managed storage
    > pool
    >     > (the user may wonder why it wasn’t migrated if they don’t get an
    > error
    >     > message back telling them this is not allowed).
    >     >
    >     >             If createMappingVolumeAndStoragePoolEnteredByUser is
    > invoked,
    >     > we seem to have a bigger problem (code is below):
    >     >
    >     >             I do not believe you are required to pass in a new
    > storage
    >     > pool for each and every volume of the VM. If the VM has, say, three
    >     > volumes, you may only try to migrate two of the volumes to new
    > storage
    >     > pools. This logic seems to assume if you want to migrate one of the
    > VM’s
    >     > volumes, then you necessarily want to migrate all of the VM’s
    > volumes. I
    >     > believe it’s possible for targetPool to come back null and later
    > throw a
    >     > NullPointerException. The old code walks through each volume of the
    > VM and
    >     > checks if there is a new storage pool specified for it. If so, do 
one
    >     > thing; else, do something else.
    >     >
    >     >                 private Map<Volume, StoragePool>
    >     > createMappingVolumeAndStoragePoolEnteredByUser(VirtualMachineProfile
    >     > profile, Host host, Map<Long, Long> volumeToPool) {
    >     >                     Map<Volume, StoragePool> volumeToPoolObjectMap =
    > new
    >     > HashMap<Volume, StoragePool>();
    >     >                     for(Long volumeId: volumeToPool.keySet()) {
    >     >                         VolumeVO volume =
    > _volsDao.findById(volumeId);
    >     >
    >     >                         Long poolId = volumeToPool.get(volumeId);
    >     >                         StoragePoolVO targetPool =
    >     > _storagePoolDao.findById(poolId);
    >     >                         StoragePoolVO currentPool =
    >     > _storagePoolDao.findById(volume.getPoolId());
    >     >
    >     >                         if (_poolHostDao.findByPoolHost(
    > targetPool.getId(),
    >     > host.getId()) == null) {
    >     >                             throw new CloudRuntimeException(String.
    > format("Cannot
    >     > migrate the volume [%s] to the storage pool [%s] while migrating VM
    > [%s] to
    >     > target host [%s]. The host does not have access to the storage pool
    >     > entered.", volume.getUuid(), targetPool.getUuid(), 
profile.getUuid(),
    >     > host.getUuid()));
    >     >                         }
    >     >                         if (currentPool.getId() ==
    > targetPool.getId()) {
    >     >                             s_logger.info(String.format("The volume
    > [%s]
    >     > is already allocated in storage pool [%s].", volume.getUuid(),
    >     > targetPool.getUuid()));
    >     >                         }
    >     >                         volumeToPoolObjectMap.put(volume,
    > targetPool);
    >     >                     }
    >     >                     return volumeToPoolObjectMap;
    >     >                 }
    >     >
    >     >             On 7/16/18, 5:13 AM, "Rafael Weingärtner" <
    >     > rafaelweingart...@gmail.com> wrote:
    >     >
    >     >                 Ok, I see what happened there with the migration to
    >     > cluster. When I re-did
    >     >                 the code I did not have this case. And therefore, in
    > the
    >     > old code, I was
    >     >                 not seeing this use case (convoluted code, lack of
    >     > documentation, and so
    >     >                 on; we all know the story). I will fix it.
    >     >
    >     >                 Regarding the managed storage issue, can you
    > describe the
    >     > “special
    >     >                 handling” you need?
    >     >
    >     >                 Are you talking about this:
    >     >
    >     >                 > if (destPool.getId() == currentPool.getId()) {
    >     >                 >     volumeToPoolObjectMap.put(volume,
    > currentPool);
    >     >                 > } else {
    >     >                 >      throw new CloudRuntimeException("Currently, a
    >     > volume on managed
    >     >                 > storage can only be 'migrated' to itself.");
    >     >                 > }
    >     >                 >
    >     >
    >     >
    >     >                 That is a simple validation, right? A validation to
    > throw
    >     > an exception if
    >     >                 the user tries to migrate the volume to some other
    > storage
    >     > pool. Is that
    >     >                 it? If that is the case, the default method
    >     >                 “getDefaultMappingOfVolumesAndS
    > toragePoolForMigration”
    >     > already takes care
    >     >                 of this. Meaning, that it will not try to move the
    > volume
    >     > to other storage
    >     >                 pool.
    >     >
    >     >                 On the other hand, we need to add a validation in 
the
    >     >                 “createMappingVolumeAndStoragePoolEnteredByUser”
    > method
    >     > then.
    >     >                 I will wait for your feedback before starting to
    > code.
    >     > Thanks for spotting
    >     >                 this issue.
    >     >
    >     >                 On Sun, Jul 15, 2018 at 9:11 PM, Tutkowski, Mike <
    >     > mike.tutkow...@netapp.com>
    >     >                 wrote:
    >     >
    >     >                 > Hi Rafael,
    >     >                 >
    >     >                 > Thanks for your time on this.
    >     >                 >
    >     >                 > Here is an example where the new code deviates
    > from the
    >     > old code in a
    >     >                 > critical fashion (code right below is new):
    >     >                 >
    >     >                 >     private Map<Volume, StoragePool>
    >     > getDefaultMappingOfVolumesAndS
    >     >                 > toragePoolForMigration(VirtualMachineProfile
    > profile,
    >     > Host targetHost) {
    >     >                 >         Map<Volume, StoragePool>
    > volumeToPoolObjectMap =
    >     > new
    >     >                 > HashMap<Volume, StoragePool>();
    >     >                 >         List<VolumeVO> allVolumes = _volsDao.
    >     > findUsableVolumesForInstance(
    >     >                 > profile.getId());
    >     >                 >         for (VolumeVO volume : allVolumes) {
    >     >                 >             StoragePoolVO currentPool =
    >     > _storagePoolDao.findById(
    >     >                 > volume.getPoolId());
    >     >                 >             if (ScopeType.HOST.equals(
    > currentPool.getScope()))
    >     > {
    >     >                 >                 createVolumeToStoragePoolMappi
    >     > ngIfNeeded(profile,
    >     >                 > targetHost, volumeToPoolObjectMap, volume,
    > currentPool);
    >     >                 >             } else {
    >     >                 >                 volumeToPoolObjectMap.put(volume,
    >     > currentPool);
    >     >                 >             }
    >     >                 >         }
    >     >                 >         return volumeToPoolObjectMap;
    >     >                 >     }
    >     >                 >
    >     >                 > What happens in the new code (above) is if the 
user
    >     > didn’t pass in a
    >     >                 > storage pool to migrate the virtual disk to (but
    > the VM
    >     > is being migrated
    >     >                 > to a new cluster), this code just assigns the
    > virtual
    >     > disk to its current
    >     >                 > storage pool (which is not going to be visible to
    > any of
    >     > the hosts in the
    >     >                 > new compute cluster).
    >     >                 >
    >     >                 > In the old code (I’m looking at 4.11.3 here), you
    > could
    >     > look around line
    >     >                 > 2337 for the following code (in the
    >     > VirtualMachineManagerImpl.
    >     >                 > getPoolListForVolumesForMigration method):
    >     >                 >
    >     >                 >                     // Find a suitable pool for 
the
    >     > volume. Call the
    >     >                 > storage pool allocator to find the list of pools.
    >     >                 >
    >     >                 >                     final DiskProfile diskProfile
    > = new
    >     >                 > DiskProfile(volume, diskOffering,
    >     > profile.getHypervisorType());
    >     >                 >                     final DataCenterDeployment
    > plan = new
    >     >                 > DataCenterDeployment(host.getDataCenterId(),
    >     > host.getPodId(),
    >     >                 > host.getClusterId(),
    >     >                 >                             host.getId(), null,
    > null);
    >     >                 >
    >     >                 >                     final List<StoragePool>
    > poolList =
    >     > new ArrayList<>();
    >     >                 >                     final ExcludeList avoid = new
    >     > ExcludeList();
    >     >                 >
    >     >                 >                     for (final 
StoragePoolAllocator
    >     > allocator :
    >     >                 > _storagePoolAllocators) {
    >     >                 >                         final List<StoragePool>
    >     > poolListFromAllocator =
    >     >                 > allocator.allocateToPool(diskProfile, profile,
    > plan,
    >     > avoid,
    >     >                 > StoragePoolAllocator.RETURN_UPTO_ALL);
    >     >                 >
    >     >                 >                         if (poolListFromAllocator
    > !=
    >     > null &&
    >     >                 > !poolListFromAllocator.isEmpty()) {
    >     >                 >                             poolList.addAll(
    >     > poolListFromAllocator);
    >     >                 >                         }
    >     >                 >                     }
    >     >                 >
    >     >                 > This old code would find an applicable storage
    > pool in
    >     > the destination
    >     >                 > cluster (one that can be seen by the hosts in that
    >     > compute cluster).
    >     >                 >
    >     >                 > I think the main error in the new logic is the
    >     > assumption that a VM can
    >     >                 > only be migrated to a host in the same computer
    > cluster.
    >     > For XenServer
    >     >                 > (perhaps for other hypervisor types?), we support
    >     > cross-cluster VM
    >     >                 > migration.
    >     >                 >
    >     >                 > The other issue I noticed is that there is no
    > logic in
    >     > the new code that
    >     >                 > checks for managed-storage use cases. If you look
    > in the
    >     >                 > VirtualMachineManagerImpl.
    > getPoolListForVolumesForMigration
    >     > method in the
    >     >                 > old code, there is special handling for managed
    > storage.
    >     > I don’t see this
    >     >                 > reproduced in the new logic.
    >     >                 >
    >     >                 > I sympathize with your point that all tests passed
    > yet
    >     > this issue was not
    >     >                 > uncovered. Unfortunately, I suspect we have a
    > fairly low
    >     > % coverage of
    >     >                 > automated tests on CloudStack. If we ever did get
    > to a
    >     > high % of automated
    >     >                 > test coverage, we might be able to spin up new
    > releases
    >     > more frequently. As
    >     >                 > the case stands today, however, there are probably
    > many
    >     > un-tested use cases
    >     >                 > when it comes to our automated suite of tests.
    >     >                 >
    >     >                 > Thanks again!
    >     >                 > Mike
    >     >                 >
    >     >                 > On 7/15/18, 4:19 PM, "Rafael Weingärtner" <
    >     > rafaelweingart...@gmail.com>
    >     >                 > wrote:
    >     >                 >
    >     >                 >     Mike, are you able to pin-point in the
    > old/replaced
    >     > code the bit that
    >     >                 > was
    >     >                 >     handling your use case?  I took the most care
    > not to
    >     > break anything.
    >     >                 >     Also, your test case, isn't it in the ACS'
    >     > integration test suite? In
    >     >                 >     theory, all test passed when we merged the PR.
    >     >                 >
    >     >                 >     I sure can take a look at it. Can you detail
    > your
    >     > use case? I mean, the
    >     >                 >     high level execution flow. What API methods
    > you do,
    >     > what you expected
    >     >                 > to
    >     >                 >     happen, and what is happening today.
    >     >                 >
    >     >                 >     On Sun, Jul 15, 2018 at 3:25 AM, Tutkowski,
    > Mike <
    >     >                 > mike.tutkow...@netapp.com>
    >     >                 >     wrote:
    >     >                 >
    >     >                 >     > It looks like this is the problematic PR:
    >     >                 >     >
    >     >                 >     > https://github.com/apache/
    > cloudstack/pull/2425/
    >     >                 >     >
    >     >                 >     > On 7/15/18, 12:20 AM, "Tutkowski, Mike" <
    >     > mike.tutkow...@netapp.com>
    >     >                 > wrote:
    >     >                 >     >
    >     >                 >     >     Hi,
    >     >                 >     >
    >     >                 >     >     While running managed-storage regression
    > tests
    >     > tonight, I
    >     >                 > noticed a
    >     >                 >     > problem that is not related to managed
    > storage.
    >     >                 >     >
    >     >                 >     >     CLOUDSTACK-10240 is a ticket asking that
    > we
    >     > allow the migration
    >     >                 > of a
    >     >                 >     > virtual disk that’s on local storage to
    > shared
    >     > storage. In the
    >     >                 > process of
    >     >                 >     > enabling this feature, the
    >     > VirtualMachineManagerImpl.
    >     >                 >     > getPoolListForVolumesForMigration method was
    >     > re-written in a way
    >     >                 > that
    >     >                 >     > completely breaks at least one use case:
    > Migrating
    >     > a VM across
    >     >                 > compute
    >     >                 >     > clusters (at least supported in XenServer).
    > If,
    >     > say, a virtual disk
    >     >                 > resides
    >     >                 >     > on shared storage in the source compute
    > cluster,
    >     > we must be able to
    >     >                 > copy
    >     >                 >     > this virtual disk to shared storage in the
    >     > destination compute
    >     >                 > cluster.
    >     >                 >     >
    >     >                 >     >     As the code is currently written, this
    > is no
    >     > longer possible. It
    >     >                 > also
    >     >                 >     > seems that the managed-storage logic has 
been
    >     > dropped for some
    >     >                 > reason in
    >     >                 >     > the new implementation.
    >     >                 >     >
    >     >                 >     >     Rafael – It seems that you worked on 
this
    >     > feature. Would you be
    >     >                 > able
    >     >                 >     > to look into this and create a PR?
    >     >                 >     >
    >     >                 >     >     Thanks,
    >     >                 >     >     Mike
    >     >                 >     >
    >     >                 >     >
    >     >                 >     >
    >     >                 >
    >     >                 >
    >     >                 >     --
    >     >                 >     Rafael Weingärtner
    >     >                 >
    >     >                 >
    >     >                 >
    >     >
    >     >
    >     >                 --
    >     >                 Rafael Weingärtner
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >
    >
    >     --
    >     Rafael Weingärtner
    >
    >
    >
    
    
    -- 
    Rafael Weingärtner

Re: Problem with CLOUDSTACK-10240 (Cannot migrate local volume to shared storage)

Reply via email to