RE: [OpenAFS] OpenAFS on ZFS (Was: Salvaging user volumes)

milek Fri, 21 Jun 2013 05:09:05 -0700

We have a very positive experience with OpenAFS + ZFS. We run it on x86
hardware (different vendors) and on top of Solaris 11 x86. We've managed to
save lots of money due to ZFS built-in compression while improving on
performance, reliability, manageability, etc.

See my slides at
http://conferences.inf.ed.ac.uk/eakc2012/slides/AFS_on_Solaris_ZFS.pdf
                 or at
http://www.ukoug.org/what-we-offer/library/openafs-on-solaris-11-x86-robert-
milkowski/afs-on-solaris-zfs-losug.pdf


Some quick comments/recommendations when running OpenAFS on ZFS:

        - enable compression

                - LZJB - you should get up-to 3x compression ratio usually
with no performance
                  impact and often performance would actually improve

                - GZIP - much better compression, on modern 2-socekt x86
servers you still can
                  easily saturate more than 2Gbps when writing and much more
than reading

                - set record size to 1MB (Solaris 11 only) or leave it set
to 128KB
                  This usually improves compression ratio and often improves
performance as well
                  Unless you have a workload with lots of small random reads
or writes
                  with a working set much bigger than the amount of RAM
available
                        - make sure you have a patch so AFS doesn't sync its
special files
                          after each meta-data update but rather at the end
of operation (like volume restores, etc.),
                          this should be integrated for some time, not sure
in which OpenAFS release though

        - do RAID in ZFS if possible
        
          This allows ZFS not only to detect data corruption (yes, it does
happen) but also to fix it
          and in an entirely transparent way to OpenAFS
          
          For really important data and RW volumes, perhaps RAID on a disk
array
          and then mirror in ZFS across two disk arrays.. 

                - if you want to use one of the RAID-Z in ZFS, and you do
lots of random physical reads to lots of files,
                  then make sure you run on Solaris 11 (pool version 29 or
newer with RAID-Z/mirror hybrid allocator),
                  or consider different RAID or HW RAID5 with ZFS on top of
it

        - disable access time updates

          OpenAFS doesn't rely on them and you will be saving on some
unnecessary i/o,
          you can disable it for entire pool and by default all file system
within the pool will inherit the setting:

                  zfs set atime=off pool

        - create multiple vicep partitions in each ZFS pool

          This is due to poor OpenAFS scalability (single thread per vicep
partition to initial pre-attachement, etc.).
          Having multiple partitions allows to better saturate available
I/O, this is especially true if your underlying 
          storage is fast

        - ZFS on disk arrays

          By default ZFS sends SCSI command to flush a cache when it closes
a transaction
          (with a special bit set so a disk array should flush a cache only
if it is not protected currently).
          Unfortunately some disk arrays will flush cache all the time
regardless if the bit is set or not
          which can affect performance *very* badly. In most cases when
running ZFS on top of a disk array
          it makes sense to disable sending the SCSI flush command entirely
- in Solaris you can do it
          either via per LUN or entirely per host. Most disk arrays will
automatically go into pass-thru mode
          if cache is not protected (battery dead, broken mirroring, etc.).
          Depending on your workload, disabling cache flushes can
dramatically improve performance.

          Add to /etc/system (and reboot a server or change via mdb):

                zfs:zfs_nocacheflush = 1


        - Increase DNLC size

          If you store millions of files in AFS then increase DNLC size on
Solaris
          By adjusting ncsize tunable in /etc/system (requires reboot)

        - put as much RAM as you can

          This is an inexpensive way of greatly improving performance, often
due to:

                - caching entire working set in RAM, you will essentially
only see writes to your disks
                  Not only it makes reads much faster but it also make
writes faster as there is no i/o for reading data
                - ZFS always stores uncompressed data in RAM, so serving
most frequently used data doesn't require
                  Any CPU cycles to decompress it
                - ZFS compresses data asynchronously when closing its
transaction (by default every 5s)
                  OpenAFS writes data in async mode to all underlying files
(except for special files), if there is
                  enough RAM to cache all of the writes when restoring a
volume or writing some data to AFS,
                  then from a client perspective there is no performance
penalty at all from using compression,
                  especially if your workload is bursty. In most cases you
are more likely to hit bottleneck on your network 
                  than on CPUs anyway

        - if running OpenAFS 1.6 disable the sync thread (the one which
syncs data every 10s)
          It is pointless on ZFS (and most other file systems) and all it
usually does
          is it negatively impacts your performance; ZFS will sync all data
every 5s anyway

          There is a patch to OpenAFS 1.6.x to make this tunable. Don't
remember in which release it is in.



All the tuning described above comes down to:

        Create more than once vicep partiotion in a ZFS pool
        zfs set atime=off pool
        zfs set recordsize=1MB pool
        Add to /etc/system: zfs:zfs_nocacheflush = 1
                                         ncsize = 4000000 (or whatever value
makes sense for your server)



The above suggestions might not be best for you as it all depends on your
specific workload.
The compression ratios of course depend on your data, but I suggest to at
least try it on a sample of data as you might be nicely surprised.


Depending on your workload and configuration you may also benefit from using
SSDs with ZFS.



-- 
Robert Milkowski
http://milek.blogspot.com



> -----Original Message-----
> From: [email protected] [mailto:openafs-info-
> [email protected]] On Behalf Of Douglas E. Engert
> Sent: 17 June 2013 14:56
> To: [email protected]
> Subject: Re: [OpenAFS] OpenAFS on ZFS (Was: Salvaging user volumes)
> 
> In June of 2010, we were running Solaris AFS file servers on Solaris
> with ZFS for partitions on a SATAbeast.
> 
> AFS reported I/O error from read() that were ZFS checksums.
> 
> Turned out the hardware logs on that SATAbeast were reporting problems
> but would continue to serve up the bad data.
> 
> Since ZFS is doing checksums when it writes and then again when it
> reads, ZFS was catching intermittent errors which other systems might
> not catch.
> 
> Here is a nice explanation of how and why ZFS does checksum.
> It also points out other source of corruption that can occur on a SAN.
> 
> http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data
> 
> And this one that sounds a lot like our problem!!
> 
> http://blogs.sun.com/elowe/entry/zfs_saves_the_day_ta
> 
> >> And this is one of the reasons why ZFS is so cool :)
> 
> Yes it is cool!
> 
> >>
> > _______________________________________________
> > OpenAFS-info mailing list
> > [email protected]
> > https://lists.openafs.org/mailman/listinfo/openafs-info
> >
> 
> --
> 
>   Douglas E. Engert  <[email protected]>
>   Argonne National Laboratory
>   9700 South Cass Avenue
>   Argonne, Illinois  60439
>   (630) 252-5444
> _______________________________________________
> OpenAFS-info mailing list
> [email protected]
> https://lists.openafs.org/mailman/listinfo/openafs-info

_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info
RE: [OpenAFS] OpenAFS on ZFS (Was: Salvaging user volumes)

Reply via email to