Re: [zfs-discuss] Sync Write - ZIL log performance - Feedback for ZFS developers?

2010-06-14 Thread Roy Sigurd Karlsbakk





On 04/10/10 09:28, Edward Ned Harvey wrote: 

- If synchronous writes are large (32K) and block aligned then the blocks are 
written directly to the pool and a small record 
written to the log. Later when the txg commits then the blocks are just linked 
into the txg. However, this processing is not 
done if there are any slogs because I found it didn't perform as well. Probably 
ought to be re-evaluated. 
Won't this affect NFS/iSCSI performance pretty badly where the ZIL is crucial? 

Vennlige hilsener / Best regards 

roy 
-- 
Roy Sigurd Karlsbakk 
(+47) 97542685 
r...@karlsbakk.net 
http://blogg.karlsbakk.net/ 
-- 
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk. 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Sync Write - ZIL log performance - Feedback for ZFS developers?

2010-04-10 Thread Edward Ned Harvey
Neil or somebody?  Actual ZFS developers?  Taking feedback here?   ;-)

 

While I was putting my poor little server through cruel and unusual
punishment as described in my post a moment ago, I noticed something
unexpected:

 

I expected that while I'm stressing my log device by infinite sync writes,
my primary storage devices would also be busy(ish).  Not really busy, but
not totally idle either.  Since the primary storage is a stripe of spindle
mirrors, obviously it can handle much more sustainable throughput than the
individual log device, but the log device can respond with smaller latency.
What I noticed was this:

 

For several seconds, *only* the log device is busy.  Then it stops, and for
maybe 0.5 secs *only* the primary storage disks are busy.  Repeat, recycle.

 

I expected to see the log device busy nonstop.  And the spindle disks
blinking lightly.  As long as the spindle disks are idle, why wait for a
larger TXG to be built?  Why not flush out smaller TXG's as long as the
disks are idle?  But worse yet . During the 1-second (or 0.5 second) that
the spindle disks are busy, why stop the log device?  (Presumably also
stopping my application that's doing all the writing.)

 

This means, if I'm doing zillions of *tiny* sync writes, I will get the best
performance with the dedicated log device present.  But if I'm doing large
sync writes, I would actually get better performance without the log device
at all.  Or else . add just as many log devices as I have primary storage
devices.  Which seems kind of crazy.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sync Write - ZIL log performance - Feedback for ZFS developers?

2010-04-10 Thread Bob Friesenhahn

On Sat, 10 Apr 2010, Edward Ned Harvey wrote:


For several seconds, *only* the log device is busy.  Then it stops, 
and for maybe 0.5 secs *only* the primary storage disks are busy.  
Repeat, recycle.


I expected to see the log device busy nonstop.  And the spindle 
disks blinking lightly.  As long as the spindle disks are idle, why 
wait for a larger TXG to be built?  Why not flush out smaller TXG’s 
as long as the disks are idle?  But worse yet … During the 1-second 
(or 0.5 second) that the spindle disks are busy, why stop the log 
device?  (Presumably also stopping my application that’s doing all 
the writing.)


What you are seeing should be expected and is good.  The intent log 
allows synchronous writes to be turned into lazy ordinary writes (like 
async writes) in the next TXG cycle.  Since the intent log is on a 
SSD, the pressure is taken off of the primary disks to serve that 
function so you will not see so many IOPS to the primary disks.


This means, if I’m doing zillions of *tiny* sync writes, I will get 
the best performance with the dedicated log device present.  But if 
I’m doing large sync writes, I would actually get better performance 
without the log device at all.  Or else … add just as many log 
devices as I have primary storage devices.  Which seems kind of 
crazy.


If this is really a problem for you, then you should be able to 
somewhat resolve it by placing a smaller cap on the maximum size of a 
TXG.  Then the system will write more often.  However, the maximum 
synchronous bulk write rate will still be limited by the bandwidth of 
your intent log devices.  Huge synchronous bulk writes are pretty rare 
since usually the bottleneck is elsewhere, such as the ethernet.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sync Write - ZIL log performance - Feedback for ZFS developers?

2010-04-10 Thread Daniel Carosone
On Sat, Apr 10, 2010 at 11:50:05AM -0500, Bob Friesenhahn wrote:
 Huge synchronous bulk writes are pretty rare since usually the 
 bottleneck is elsewhere, such as the ethernet.

Also, large writes can go straight to the pool, and the zil only logs
the intent to commit those blocks (ie, link them into the zfs data
structure).   I don't recall what the threshold for this is, but I
think it's one of those Evil Tunables.

--
Dan.


pgpUn4A0x96oI.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sync Write - ZIL log performance - Feedback for ZFS developers?

2010-04-10 Thread Neil Perrin

On 04/10/10 09:28, Edward Ned Harvey wrote:


Neil or somebody?  Actual ZFS developers?  Taking feedback here?   ;-)

 

While I was putting my poor little server through cruel and unusual 
punishment as described in my post a moment ago, I noticed something 
unexpected:


 

I expected that while I'm stressing my log device by infinite sync 
writes, my primary storage devices would also be busy(ish).  Not 
really busy, but not totally idle either.  Since the primary storage 
is a stripe of spindle mirrors, obviously it can handle much more 
sustainable throughput than the individual log device, but the log 
device can respond with smaller latency.  What I noticed was this:


 

For several seconds, **only** the log device is busy.  Then it stops, 
and for maybe 0.5 secs **only** the primary storage disks are busy.  
Repeat, recycle.




These are the txgs getting pushed out.


 

I expected to see the log device busy nonstop.  And the spindle disks 
blinking lightly.  As long as the spindle disks are idle, why wait for 
a larger TXG to be built?  Why not flush out smaller TXG's as long as 
the disks are idle?


Sometimes it's more efficient to batch up requests. Less blocks are 
written. As you mentioned you weren't stressing the system heavily.
ZFS will perform differently when under pressure. It will shorten the 
time between txgs if the data arrives quicker.


  But worse yet ... During the 1-second (or 0.5 second) that the 
spindle disks are busy, why stop the log device?  (Presumably also 
stopping my application that's doing all the writing.)


Yes, this has been observed by many people. There are two sides to this 
problem related to the

CPU and IO used while pushing a txg:

6806882 need a less brutal I/O scheduler
6881015 ZFS write activity prevents other threads from running in a 
timely manner


The CPU side (6881015) was fixed relatively recently in snv_129.

 

This means, if I'm doing zillions of **tiny** sync writes, I will get 
the best performance with the dedicated log device present.  But if 
I'm doing large sync writes, I would actually get better performance 
without the log device at all.  Or else ... add just as many log 
devices as I have primary storage devices.  Which seems kind of crazy.


Yes you're right, there are times when it's better to bypass the slog 
and use the pool disks which can deliver better bandwidth.


The algorithm for where and what the ZIL writes has got quite complex:

- There was another change recently to bypass the slog if 1MB had been 
sent to it and 2MB were waiting to be sent.
- There's a new property logbias which when set to throughput directs 
the ZIL to send all of it's writes to the main pool devices thus freeing 
the slog for more latency sensitive work (ideal for database data files).
- If synchronous writes are large (32K) and block aligned then the 
blocks are written directly to the pool and a small record
 written to the log. Later when the txg commits then the blocks are 
just linked into the txg. However, this processing is not
 done if there are any slogs because I found it didn't perform as well. 
Probably ought to be re-evaluated.
- There are further tweaks being suggested and which might make it to a 
ZIL near you soon.


Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sync Write - ZIL log performance - Feedback for ZFS developers?

2010-04-10 Thread Neil Perrin

On 04/10/10 14:55, Daniel Carosone wrote:

On Sat, Apr 10, 2010 at 11:50:05AM -0500, Bob Friesenhahn wrote:
  
Huge synchronous bulk writes are pretty rare since usually the 
bottleneck is elsewhere, such as the ethernet.



Also, large writes can go straight to the pool, and the zil only logs
the intent to commit those blocks (ie, link them into the zfs data
structure).   I don't recall what the threshold for this is, but I
think it's one of those Evil Tunables.
  

This is zfs_immediate_write_sz which is 32K.  However this only happens
currently if you don't have any slogs. If logbias is set to throughput then
all writes go straight to the pool regardless of zfs_immediate_write_sz.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss