Re: LARGE single-system Cyrus installs?

Ian G Batten Wed, 21 Nov 2007 01:56:05 -0800


On 20 Nov 07, at 1756, David Lang wrote:

however a fsync on a journaled filesystem just means the data needsto bewritten to the journal, it doesn't mean that the journal needs tobe flushed to
disk.
on ext3 if you have data=journaled then your data is in the journalas well andall that the system needs to do on a fsync is to write things tothe journal (a
nice sequential write),

Assuming the journal is on a distinct device and the distinct devicecan take the load. It isn't on ZFS, although work is in progress.One of the many benefits of the sadly underrated Solaris Disksuiteproduct was the metatrans devices, which at least permitted metadataupdates to go to a distinct device. When the UFS logging code wentinto core Solaris (the ON integration) that facility was dropped,sadly. My Pillar NFS server does data logging to distinct diskgroups, but mostly --- like such boxes tend to do --- relies on 12GBof RAM and a battery. A sequential write is only of benefit if thehead is in the right place and the platter is at the right rotationalposition and the write is well-matched to the transfer rate of thespindle: if the spindle is doing large sequential writes while alsoservicing reads and writes elsewhere, or can't keep up with writingtracks flat out, the problems increase.

for cyrus you should have the same sort of requirements that youwould have fora database server, including the fact that without a battery-backeddisk cache(or solid state drive) to handle your updates, you end up beingthrottled byyour disk rotation rate (you can only do a single fsync write perrotation, andthat good only if you don't have to seek), RAID 5/6 arrays are evenworse, asalmost all systems will require a read of the entire stripe beforewriting a
single block (and it's parity block) back out, and since the stripe is
frequently larger then the OS readahead, the OS throws much of thedata away
immediatly.

if we can identify the files that are the bottlenecks it would be very
interesting to see the result of puttng them on a solid-state drive.

I've split the meta-data out into separate partitions. The meta datais stored in ZFS filesystems in a pool which is a RAID 0+1 4 diskgroup with SAS drives, the message data is coming out of the lowestQoS on my Pillar. A ten second fsstat on VM operations shows that byrequest (this measures filesystem activity, not the implied diskactivity) it's the meta partitions taking the pounding (ten secondsample):


  map addmap delmap getpag putpag pagio
    0      0      0     45      0     0 /var/imap
   11     11     11     17      0     0 /var/imap/meta-partition-1
  290    290    290    463      5     0 /var/imap/meta-partition-2
  139    139    139    183      3     0 /var/imap/meta-partition-3
   66     66     66    106     10     0 /var/imap/meta-partition-7
  347    347    342    454     16     0 /var/imap/meta-partition-8
   57     57     57     65      5     0 /var/imap/meta-partition-9
    4      4      8      4      0     0 /var/imap/partition-1
   11     11     22     14      0     0 /var/imap/partition-2
    1      1      2      1      0     0 /var/imap/partition-3
    6      6     12     49     10     0 /var/imap/partition-7
   15     15     28    457      0     0 /var/imap/partition-8
    1      1      2      2      0     0 /var/imap/partition-9

Similarly, by non-VM operation:

 new  name   name  attr  attr lookup rddir  read read  write write
 file remov  chng   get   set    ops   ops   ops bytes   ops bytes

0 0 0 2.26K 0 6.15K 0 0 0 45 1.22K /var/imap0 0 0 356 0 707 0 0 0 6 3.03K /var/imap/meta-partition-13 0 3 596 0 902 0 6 135K 90 305K /var/imap/meta-partition-20 0 0 621 0 1.08K 0 0 0 3 1.51K /var/imap/meta-partition-33 0 3 1.04K 0 1.70K 0 6 149K 36 650K /var/imap/meta-partition-70 0 0 2.28K 0 4.24K 0 0 0 7 1.87K /var/imap/meta-partition-80 0 0 18 0 32 0 0 0 2 176 /var/imap/meta-partition-92 2 2 22 0 30 0 1 2.37K 2 7.13K /var/imap/partition-13 4 12 84 0 157 0 1 677 3 7.51K /var/imap/partition-21 1 1 1.27K 0 2.16K 0 0 0 1 3.75K /var/imap/partition-32 2 4 35 0 56 0 1 3.97K 36 279K /var/imap/partition-71 2 1 256 0 514 0 0 0 1 3.75K /var/imap/partition-80 0 0 0 0 0 0 0 0 0 0 /var/imap/partition-9

And looking at the real IO load, ten seconds of zpool (for the metadata and /var/imap_


                 capacity     operations    bandwidth
pool           used  avail   read  write   read  write
------------  -----  -----  -----  -----  -----  -----
pool1         51.6G  26.4G      0    142  54.3K  1001K
  mirror      25.8G  13.2G      0     68  38.4K   471K
    c0t0d0s4      -      -      0     36  44.7K   471K
    c0t1d0s4      -      -      0     36      0   471K
  mirror      25.8G  13.2G      0     73  15.9K   530K
    c0t2d0s4      -      -      0     40  28.4K   531K
    c0t3d0s4      -      -      0     39  6.39K   531K
------------  -----  -----  -----  -----  -----  -----

is very different to ten seconds of sar for the NFS:

09:46:34   device        %busy   avque   r+w/s  blks/s  avwait  avserv

[...]
           nfs73             1     0.0       3     173     0.0     4.2
           nfs86             3     0.1      12     673     0.0     6.5
           nfs87             0     0.0       0       0     0.0     0.0
           nfs89             0     0.0       0       0     0.0     0.0
           nfs96             0     0.0       0       0     0.0     1.8
           nfs101            1     0.0       1      25     0.0     8.0
           nfs102            0     0.0       0       4     0.0     9.4

The machine has a _lot_ of memory (32GB) so it's likely that all mailthat is delivered and then read within ten minutes never gets readback from the message store: the NFS load is almost entirely write asseen from the server.

ian

----
Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: LARGE single-system Cyrus installs?

Reply via email to