Re: [OpenAFS] fine-grained incrementals?

Jeffrey Hutzelman Wed, 23 Feb 2005 15:27:23 -0800

On Wednesday, February 23, 2005 01:08:29 PM -0800 Mike Fedyk <[EMAIL PROTECTED]> wrote:

Jeffrey Hutzelman wrote:

On Wednesday, February 23, 2005 11:44:17 AM -0800 Mike Fedyk
<[EMAIL PROTECTED]> wrote:

 1) r/w volumes pause all activity during a replication release
 2) backups need to show only the blocks changed since the last backup
(instead of entire files)
 3) (it seems) volume releases copy the entire volume image instead of
the changes since the last release

It looks like all of these issues would be solved in part or completely
with the introduction of "Multiple volume versions" (as listed on the
openafs.org web site under projects).

1) would be solved by creating a clone before a release and releasing
from that.

That already happens.  But the cloning operation takes time, and the
source volume is indeed busy during that time.


Interesting.  Is there documentation on the AFS format and how it does
COW?  I'm familiar with Linux LVM, so it should use similar concepts,
except that doing COW at the filesystem level can be more powerful and
complicated than at the block device layer.

In LVM basically a snapshot/clone just requires a small volume for block
pointers, and incrementing the user count on the PEs (physical extents).
How does AFS do this, and why is it taking a noticeable amount of time
(also what is the AFS equivalent of PE block size)?

There is not particularly much documentation on the on-disk structure of AFS fileserver data; you pretty much need to UTSL.

AFS does copy-on-write at the per-vnode layer. Each vnode has metadata which is kept in the volume's vnode indices; among other things, this includes the identifier of the physical file which contains the vnode's contents (for the inode fileserver, this is an inode number; for namei it's a 64-bit "virtual inode number" which can be used to derive the filename). The underlying inode has a link count (in the filesystem for inode; in the link table for namei) which reflects how many vnodes have references to that inode. When you write to a vnode whose underlying inode has more than one reference, the fileserver allocates a new one for the vnode you're writing to, and copies the contents.

A cloned volume has its own vnode indices. The cloning process basically involves creating new indices and incrementing the link count on all of the underlying inodes. Unfortunately, usually you are either updating or removing an existing clone, which means decrementing the link counts on all of its vnodes, and possibly actually freeing the associated data. On a volume with lots of files, this turns out to be time-consuming.

2) would be solved by creating a clone each time there is a
backup and comparing it to the previous backup clone.  and 3) would be a
similar process with volume releases.

This is not a bad idea.  Of course, you still have problems if the
start time for the dump you want doesn't correspond to a clone you
already have, but that situation can probably be avoided in practice.


Yes, and that is why I said it would be a specific and separate clone
just for the incremental backups, so the timing isn't based on the backup
clone for instance.

Yes, but that doesn't necessarily solve the problem. Real backup scenarios may involve multiple levels of backups, which necessitates multiple clones. And, situations can arise in which the clone you have does not have precisely the right date; for example, the backup system may lose a dump for some reason (lost/damaged tape; the backup machine crashed before syncing the backup database to disk, etc). In a well-designed backup system, these cases should be rare, but they will occur.

It should be noted that I do not consider the bu* tools that come with AFS to have anything to do with a well-designed backup system.

An interesting thought would be to clone a replicated volume on another
machine that has more, but slower space than a fileserver holding the
active r/w volumes and running the backups from that machine to keep load
down on the r/w volume fileserver.

An interesting idea, except that you forget how the replicated volume gets there - by a release process which involves creating a temporary release clone.

Replicated volumes are designed to provide load-balancing and availability for data which is infrequently written but either frequently read or important to have available (often both properties apply to the same data), like software. They are not designed to provide backup copies of frequently-written data like user directories. Within their design goals, they work quite well:

- volumes can be replicated over many servers to meet demand
- making the R/W unavailable for a while during a release is not a big
 deal because it is never accessed by normal users
- not having failover is unimportant because if the server containing
 your R/W volumes is down, then you have a fileserver down and should
 be fixing it instead of updating R/W volumes.

The first step can just allow the admin to mount the clone volumes where
they want them.  OAFS allows you to have volumes that are not connected
to the AFS tree that users see, doesn't it?

Yes, you can have volumes that are not mounted anywhere. The issue is how to name and refer to these clones. The "normal" RO and BK clones appear in the same VLDB entry as the base volume, and the cache manager and other tools know to strip off the .backup or .readonly suffix to get the name they must look up in the VLDB.

I guess what I'm trying to say is that the item in the roadmap is not "be able to have multiple clones of a volume", because we have had that for quite some time. The roadmap item _is_ to have a user-visible volume snapshotting mechanism where you can find and access multiple snapshots of a volume.

-- Jeffrey T. Hutzelman (N3NHS) <[EMAIL PROTECTED]>
  Sr. Research Systems Programmer
  School of Computer Science - Research Computing Facility
  Carnegie Mellon University - Pittsburgh, PA

_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Re: [OpenAFS] fine-grained incrementals?

Reply via email to