[DISCUSS] NFS cache storage issue on object_store

Edison Su Mon, 03 Jun 2013 16:18:47 -0700

Let's start a new thread about NFS cache storage issues on object_store.
First, I'll go through how NFS storage works on master branch, then how it 
works on object_store branch, then let's talk about the "issues".


0.       Why we need NFS secondary storage? Nfs secondary storage is used as a 
place to store templates/snapshots etc, it's zone wide, and it's widely 
supported by most of hypervisors(except HyperV). NFS storage exists in 
CloudStack since 1.x. With the rising of object storage, like S3/Swift, 
CloudStack adds the support of Swift in 3.x, and S3 in 4.0. You may wonder, if 
S3/Swift is used as the place to store templates/snapshots, then why we still 
need NFS secondary storage?

There are two reasons for that:

a.       CloudStack storage code is tightly coupled with NFS secondary storage, 
so when adding Swift/S3 support, it's likely to take shortcut, leave NFS 
secondary storage as it is.

b.      Certain hypervisors, and certain storage related operations, can not 
directly operate on object storage.
Examples:

b.1 When backing up snapshot(the snapshot taken from xenserver hypervisor) from 
primary storage to S3 in xenserver

If there are snapshot chains on the volume, and if we want to coalesce the 
snapshot chains into a new disk, then copy it to S3, we either, coalesce the 
snapshot chains on primary storage, or on an extra storage repository (SR) that 
supported by Xenserver.

If we coalesce it on primary storage, then may blow up the primary storage, as 
the coalesced new disk may need a lot of space(thinking about, the new disk 
will contain all the content in from leaf snapshot, all the way up to base 
template), but the primary storage is not planned to this operation(cloudstack 
mgt server is unaware of this operation, the mgt server may think the primary 
storage still has enough space to create volumes).

While xenserver doesn't have API to coalesce snapshots directly to S3, so we 
have to use other storages that supported by Xenserver, that's why the NFS 
storage is used during snapshot backup. So what we did is that first call 
xenserver api to coalesce the snapshot to NFS storage, then copy the newly 
created file into S3. This is what we did on both master branch and 
object_store branch.
                               b.2 When create volume from snapshot if the 
snapshot is stored on S3.
                                                 If the snapshot is a delta 
snapshot, we need to coalesce them into a new volume. We can't coalesce 
snapshots directly on S3, AFAIK, so we have to download the snapshot and its 
parents into somewhere, then coalesce them with xenserver's tools. Again, there 
are two options, one is to download all the snapshots into primary storage, or 
download them into NFS storage:
                                                If we download all the 
snapshots into primary storage directly from S3, then first we need find a way 
import snapshot from S3 into Primary storage(if primary storage is a block 
device, then need extra care) and then coalesce them. If we go this way, need 
to find a primary storage with enough space, and even worse, if the primary 
storage is not zone-wide, then later on, we may need to copy the volume from 
one primary storage to another, which is time consuming.
                                                If we download all the 
snapshots into NFS storage from S3, then coalesce them, and then copy the 
volume to primary storage. As the NFS storage is zone wide, so, you can copy 
the volume into whatever primary storage, without extra copy. This is what we 
did in master branch and object_store branch.
                              b.3, some hypervisors, or some storages do not 
support directly import template into primary storage from a URL. For example, 
if Ceph is used as primary storage, when import a template into RBD, need 
transform a Qcow2 image into RAW disk, then into RBD format 2. In order to 
transform an image from Qcow2 image into RAW disk, you need extra file system, 
either a local file system(this is what other stack does, which is not scalable 
to me), or a NFS storage(this is what can be done on both master and 
object_store). Or one can modify hypervisor or storage to support directly 
import template from S3 into RBD. Here is the 
link(http://www.mail-archive.com/[email protected]/msg14411.html), 
that Wido posted.
                 Anyway, there are so many combination of hypervisors and 
storages: for some hypervisors with zone wide file system based storage(e.g. 
KVM + gluster/NFS as primary storage), you don't need extra nfs storage. Also 
if you are using VMware or HyperV, which can import template from a URL, 
regardless which storage your are using, then you don't need extra NFS storage. 
While if you are using xenserver, in order to create volume from delta 
snapshot, you will need a NFS storage, or if you are using KVM + Ceph, you also 
may need a NFS storage.
                Due to above reasons, NFS cache storage is need in certain 
cases if S3 is used as secondary storage. The combination of hypervisors and 
storages are quite complicated, to use cache storage or not, should be case by 
case. But as long as cloudstack provides a framework, gives people the choice 
to enable/disable cache storage on their own, then I think the framework is  
good enough.


1.       Then let's talk about how NFS storage works on master branch, with or 
without S3.
If S3 is not used, here is the how NFS storage is used:

1.1   Register a template/ISO: cloudstack downloads the template/ISO into NFS 
storage.

1.2   Backup snapshot: cloudstack sends a command to xenserver hypervisor, 
issue vdi.copy command copy the snapshot to NFS, for kvm, directly use "cp" or 
"qemu-img convert" to copy the snapshot into NFS storage.

1.3   Create volume from snapshot: If the snapshot is a delta snapshot, 
coalesce them on NFS storage, then vdi.copy it from NFS to primary storage. If 
it's KVM, use "cp" or "qemu-img convert" to copy the snapshot from NFS storage 
to primary storage.


               If S3 is used:

1.4   Register a template/ISO: download the template/ISO into NFS storage 
first, then there is background thread, which can upload the template/ISO from 
NFS storage into S3 regularly. The template is in Ready state, only means the 
template is stored on NFS storage, but admin doesn't know the template is 
stored on the S3 or not. Even worse, if there are multiple zones, cloudstack 
will copy the template from one zone wide NFS storage into another NFS storage 
in another zone, while there is already has a region wide S3 available. As the 
template is not directly uploaded to S3 when registering a template, it will 
take several copy in order to spread the template into a region wide.

1.5   Backup snapshot: cloudstack sends a command to xenserver hypervisor, copy 
the snapshot to NFS storage, then immediately, upload the snapshot from NFS 
storage into S3. The snapshot is in Backedup state, not only means the snapshot 
is in  NFS storage, but also means it's stored on S3.

1.6   Create volume from snapshot: download the snapshot  and it's parent 
snapshots from S3 into NFS storage, then coalesce and vdi.copy the volume from 
NFS to primary storage.



2.       Then let's talk about how it works on object_store:
If S3 is not used, there is ZERO change from master branch. How the NFS 
secondary storage works before, is the same on object_store.
If S3 is used, and NFS cache storage used also(which is by default):
   2.1 Register a template/ISO: the template/ISO are directly uploaded to S3, 
there is no extra copy to NFS storage. When the template is in "Ready" state, 
means the template is stored on S3.                  It implies that: the 
template is immediately available in the region as soon as it's in Ready State. 
And admin can clearly knows the status of template on S3, what's percentage of 
the uploading, is it failed or succeed? Also if register template failed for 
some reason, admin can issue the register template command again. I would say 
the change of how to register template into S3 is far better than what we did 
on master branch.
   2.2 Backup snapshot: it's same as master branch, sends a command to 
xenserver host, copy the snapshot into NFS, then upload to S3.
   2.3 Create volume from snapshot: it's the same as master branch, download 
snapshot and it's parent snaphots from S3 into NFS, then copy it from NFS to 
primary storage.
>From above few typical usage cases, you may understand how S3 and NFS cache 
>storage is used, and what's difference between object_store branch and master 
>branch: basically, we only change the way how to register a template, nothing 
>else.
If S3 is used, and no NFS cache storage is used(it's possible, depends on which 
datamotion strategy is used):
    2.4 Register a template/ISO: it's the same as 2.1
    2.5 Backup snapshot: export the snapshot from primary storage into S3 
directly
    2.6 Create volume from snapshot: download snapshots from S3 into primary 
storage directly, then coalesce and create volume from it.

          Hope above explanation will tell the truth how the system works on 
object_store, and clarify the misconception/misunderstanding  about 
object_store branch. Even the change is huge, we still maintain the back 
compatibility. If you don't want to use S3, only want to existing NFS storage, 
it's definitely OK, it works the same as before. If you want to use S3, we 
provide a better S3 implementation when registering template/ISO. If you want 
to use S3 without NFS storage, that's also definitely OK,  the framework is 
quite flexible to accommodate different solutions.

Ok, let's talk  about the NFS storage cache issues.
The issue about NFS cache storage is discussed in several threads, back and 
forth. All in all, the NFs cache storage is only one usage case out of three 
usage cases supported by object_store branch. It's not something that if it has 
issue, then everything doesn't work.
In above 2.2 and 2.3, it shows how the NFS cache storage is involved during 
snapshot related operations. The complains about there is no aging policy, no 
capacity planner for NFS cache storage, is happened when download a snapshot 
from S3 into NFS, or copy a snapshot from primary storage into NFS, or download 
template from S3 into NFS. Yes, it's an issue, the NFS cache storage can be 
used out, if there is no capacity planner, and no aging out policy. But can it 
be fixed? Is it a design issue?
Let's talk the code: Here is the code related to NFS cache storage, not much, 
only one class depends on NFS cache storage: 
https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=engine/storage/datamotion/src/org/apache/cloudstack/storage/motion/AncientDataMotionStrategy.java;h=a01d2d30139f70ad8c907b6d6bc9759d47dcc2d6;hb=refs/heads/object_store
Take copyVolumeFromSnapshot as example, which will be called when create Volume 
from snapshot, if first calls cacheSnapshotChain, which will call 
cacheMgr.createCacheObject to download the snapshot into NFs cache storage. 
StorageCacheManagerImpl-> createCacheObject is the only place to create objects 
on NFs cache storage, the code is at 
https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=engine/storage/cache/src/org/apache/cloudstack/storage/cache/manager/StorageCacheManagerImpl.java;h=cb5ea106fed3e5d2135dca7d98aede13effcf7d9;hb=refs/heads/object_store
In createCacheObject, it will first find out a cache storage, in case there are 
multiple cache storages available in a scope:
DataStore cacheStore = this.getCacheStorage(scope);
getCacheStorage will call StorageCacheAllocator to find out a proper NFS cache 
storage. So StorageCacheAllocator is the place to choose NFS cache storage 
based on certain criteria, the current implementation only randomly choose one 
of them, we can add a new allocator algorithm, based on capacity etc, etc.
Regarding capacity reservation, there is already a table, called 
op_host_capacity which has entry for NFS secondary storage, we can reuse this 
entry to store capacity information about NFS cache storages(such as, total 
size, available/used capacity etc). So when every call createCacheObject, we 
can call StorageCacheAllocator to find out a proper NFS storage based on first 
fit criteria, then increase used capacity in op_host_capacity table. If the 
create cache object failed, return the capacity to op_host_capacity.

Regarding the aging out policy, we can start a background thread on mgt server, 
which will scan all the objects created on NFS cache storage(the tables called: 
snapshot_store_ref, template_store_ref, volume_store_ref), each entry of these 
tables has a column called: updated, every time, when the object's state is 
changed, the "updated" column will be got updated also. When the object's state 
is changed? Every time, when the object is used in some contexts(such as copy 
the snapshot on NFS cache storage into somewhere), the object's state will be 
changed  accordingly, such as "Copying", means the object is being copied to 
some place, which is exactly the information we need to implement LRU algorithm.

How do you guys think about the fix? If you have better solution, please let me 
know.

[DISCUSS] NFS cache storage issue on object_store

Reply via email to