On Tue, 23 Jun 2009, milosz wrote:

is this a direct write to a zfs filesystem or is it some kind of zvol export?

This is direct write to a zfs filesystem implemented as six mirrors of 15K RPM 300GB drives on a Sun StorageTek 2500. This setup tests very well under iozone and performs remarkably well when extracting from large tar files.

anyway, sounds similar to this:

http://opensolaris.org/jive/thread.jspa?threadID=105702&tstart=0

Yes, this does sound very similar. It looks to me like data from read files is clogging the ARC so that there is no more room for more writes when ZFS periodically goes to commit unwritten data. The "Perfmeter" tool shows that almost all disk I/O occurs during a brief interval of time. The storage array is capable of writing at high rates, but ZFS is coming at it with huge periodic writes which are surely much larger than what the array's internal buffering can handle.

What is clear to me is that my drive array is "loafing". The application runs much slower than expected and zfs is to blame for this. Observed write performance could be sustained by a single fast disk drive. In fact, if I direct the output to a single SAS drive formatted with UFS, the observed performance is fairly similar except there are no stalls until iostat reports that the drive is extremely (close to 99%) busy. When the UFS-formatted drive is reported to be 60% busy (at 48MB/second), application execution is very smooth. If a similar rate is sent to the ZFS pool (52.9MB/second according to zpool iostat) and the individual drives in the pool are reported to be 5 to 33% busy (24-31% for 60 second average), then execution stutters for three seconds at a time as the 1.5GB to 3GB of "written" data which has been batched up is suddenly written.

Something else interesting I notice is that performance is not consistent over time:

% zpool iostat Sun_2540 60
                capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
Sun_2540     460G  1.18T    368    447  45.7M  52.9M
Sun_2540     463G  1.18T    336    400  42.1M  47.5M
Sun_2540     465G  1.17T    341    400  42.6M  47.2M
Sun_2540     469G  1.17T    280    473  34.8M  55.9M
Sun_2540     472G  1.17T    286    449  35.5M  52.5M
Sun_2540     474G  1.17T    338    391  42.1M  45.7M
Sun_2540     477G  1.16T    332    400  41.3M  47.0M
Sun_2540     479G  1.16T    300    356  37.5M  41.4M
Sun_2540     482G  1.16T    314    381  39.3M  43.8M
Sun_2540     485G  1.15T    520    479  63.0M  55.9M
Sun_2540     490G  1.15T    564    722  67.3M  84.7M
Sun_2540     494G  1.15T    586    539  70.4M  63.1M
Sun_2540     499G  1.14T    549    698  66.9M  81.9M
Sun_2540     504G  1.14T    547    749  65.6M  87.7M
Sun_2540     507G  1.13T    584    495  70.8M  57.8M
Sun_2540     512G  1.13T    544    822  64.9M  91.1M
Sun_2540     516G  1.13T    596    527  72.0M  60.4M
Sun_2540     521G  1.12T    561    759  68.0M  87.2M
Sun_2540     526G  1.12T    548    779  65.9M  88.6M

A 2X variation in minute-to-minute performance while performing consistently similar operations is remarkable. Also notice that the write data rates are gradually increasing (on average) even though the task being performed remains the same.

Here is a Perfmeter graph showing what is happening in normal operation:

http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-stalls.png

and here is one which shows what happens if fsync() is used to force the file data entirely to disk immediately after each file has been written:

http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-fsync.png

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to