Re: [zfs-discuss] NFS performance near zero on a very full pool

2010-09-09 Thread Neil Perrin

Arne,

NFS often demands it's transactions are stable before returning.
This forces ZFS to do the system call synchronously. Usually the
ZIL (code) allocates and writes a new block in the intent log chain to 
achieve this.

If ever it fails to allocate a block (of the size requested) it it forced
to close the txg containing the system call. Yes this can be extremely
slow but there is no other option for the ZIL. I'm surprised the wait is 
30 seconds.
I would expect mush less, but finding room for the rest of the txg data 
and metadata

would also be a challenge.

Most (maybe all?) file systems perform badly when out of space. I 
believe we give a recommended

free size and I thought it was 90%.

Neil.

On 09/09/10 09:00, Arne Jansen wrote:

Hi,

currently I'm trying to debug a very strange phenomenon on a nearly full
pool (96%). Here are the symptoms: over NFS, a find on the pool takes
a very long time, up to 30s (!) for each file. Locally, the performance
is quite normal.
What I found out so far: It seems that every nfs write (rfs3_write) blocks
until the txg is flushed. This means a write takes up to 30 seconds. During
this time, the nfs calls block, occupying all NFS server threads. With all
server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait
until the writes finish, bringing the performance of the server effectively
down to zero.
It may be that the trigger for this behavior is around 95%. I managed to bring
the pool down to 95%, now the writes get served continuously as it should be.

What is the explanation for this behaviour? Is it intentional and can the
threshold be tuned? I experienced this on Sol10 U8.

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS performance near zero on a very full pool

2010-09-09 Thread Neil Perrin

I should also have mentioned that if the pool has a separate log device
then this shouldn't happen.Assuming the slog is big enough then it
it should have enough blocks to not be forced into using main pool 
device blocks.


Neil.

On 09/09/10 10:36, Neil Perrin wrote:

Arne,

NFS often demands it's transactions are stable before returning.
This forces ZFS to do the system call synchronously. Usually the
ZIL (code) allocates and writes a new block in the intent log chain to 
achieve this.

If ever it fails to allocate a block (of the size requested) it it forced
to close the txg containing the system call. Yes this can be extremely
slow but there is no other option for the ZIL. I'm surprised the wait 
is 30 seconds.
I would expect mush less, but finding room for the rest of the txg 
data and metadata

would also be a challenge.

Most (maybe all?) file systems perform badly when out of space. I 
believe we give a recommended

free size and I thought it was 90%.

Neil.

On 09/09/10 09:00, Arne Jansen wrote:

Hi,

currently I'm trying to debug a very strange phenomenon on a nearly full
pool (96%). Here are the symptoms: over NFS, a find on the pool takes
a very long time, up to 30s (!) for each file. Locally, the performance
is quite normal.
What I found out so far: It seems that every nfs write (rfs3_write) 
blocks
until the txg is flushed. This means a write takes up to 30 seconds. 
During
this time, the nfs calls block, occupying all NFS server threads. 
With all
server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to 
wait
until the writes finish, bringing the performance of the server 
effectively

down to zero.
It may be that the trigger for this behavior is around 95%. I managed 
to bring
the pool down to 95%, now the writes get served continuously as it 
should be.


What is the explanation for this behaviour? Is it intentional and can 
the

threshold be tuned? I experienced this on Sol10 U8.

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS performance near zero on a very full pool

2010-09-09 Thread Arne Jansen

Hi Neil,

Neil Perrin wrote:

NFS often demands it's transactions are stable before returning.
This forces ZFS to do the system call synchronously. Usually the
ZIL (code) allocates and writes a new block in the intent log chain to 
achieve this.

If ever it fails to allocate a block (of the size requested) it it forced
to close the txg containing the system call. Yes this can be extremely
slow but there is no other option for the ZIL. I'm surprised the wait is 
30 seconds.
I would expect mush less, but finding room for the rest of the txg data 
and metadata

would also be a challenge.


I think this is not what we saw, for two reason:
 a) we have a mirrored slog device. According to zpool iostat -v only 16MB
out of 4GB were in use.
 b) it didn't seem like the txg would have been closed early. Rather it kept
approximately the 30 second intervals.

Internally we came up with a different explanation, without any backing that
it might be correct: When the pool reaches 96%, zfs goes into a 'self defense'
mode. Instead of allocating block from ZIL, every write turns synchronous and
has to wait for the txg to finish naturally. The reasoning behind this might
be that even if ZIL is available, there might not be enough space left to commit
the ZIL to the pool. To prevent this, zfs doen't use ZIL when the pool is above
96%. While this might be proper for small pools, on large pools 4% are still
some TB of free space, so there should be an upper limit of maybe 10GB on this
hidden reserve.
Also this sudden switch of behavior is completely unexpected and at least under-
documented.



Most (maybe all?) file systems perform badly when out of space. I 
believe we give a recommended

free size and I thought it was 90%.


In this situation, not only writes suffered, but as a side effect reads also
came to a nearly complete halt.

--
Arne




Neil.

On 09/09/10 09:00, Arne Jansen wrote:

Hi,

currently I'm trying to debug a very strange phenomenon on a nearly full
pool (96%). Here are the symptoms: over NFS, a find on the pool takes
a very long time, up to 30s (!) for each file. Locally, the performance
is quite normal.
What I found out so far: It seems that every nfs write (rfs3_write) 
blocks
until the txg is flushed. This means a write takes up to 30 seconds. 
During
this time, the nfs calls block, occupying all NFS server threads. With 
all

server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait
until the writes finish, bringing the performance of the server 
effectively

down to zero.
It may be that the trigger for this behavior is around 95%. I managed 
to bring
the pool down to 95%, now the writes get served continuously as it 
should be.


What is the explanation for this behaviour? Is it intentional and can the
threshold be tuned? I experienced this on Sol10 U8.

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS performance near zero on a very full pool

2010-09-09 Thread Richard Elling
On Sep 9, 2010, at 10:09 AM, Arne Jansen wrote:

 Hi Neil,
 
 Neil Perrin wrote:
 NFS often demands it's transactions are stable before returning.
 This forces ZFS to do the system call synchronously. Usually the
 ZIL (code) allocates and writes a new block in the intent log chain to 
 achieve this.
 If ever it fails to allocate a block (of the size requested) it it forced
 to close the txg containing the system call. Yes this can be extremely
 slow but there is no other option for the ZIL. I'm surprised the wait is 30 
 seconds.
 I would expect mush less, but finding room for the rest of the txg data and 
 metadata
 would also be a challenge.
 
 I think this is not what we saw, for two reason:
 a) we have a mirrored slog device. According to zpool iostat -v only 16MB
out of 4GB were in use.
 b) it didn't seem like the txg would have been closed early. Rather it kept
approximately the 30 second intervals.
 
 Internally we came up with a different explanation, without any backing that
 it might be correct: When the pool reaches 96%, zfs goes into a 'self defense'
 mode. Instead of allocating block from ZIL, every write turns synchronous and
 has to wait for the txg to finish naturally. The reasoning behind this might
 be that even if ZIL is available, there might not be enough space left to 
 commit
 the ZIL to the pool. To prevent this, zfs doen't use ZIL when the pool is 
 above
 96%. While this might be proper for small pools, on large pools 4% are still
 some TB of free space, so there should be an upper limit of maybe 10GB on this
 hidden reserve.

I do not believe this is correct.  At 96% the first-fit algorithm changes to 
best-fit
and ganging can be expected. This has nothing to do with the ZIL.  There is 
already a reserve set aside for metadata and the ZIL so that you can remove
files when the file system is 100% full.  This reserve is 32 MB or 1/64 of the 
pool
size.

 Also this sudden switch of behavior is completely unexpected and at least 
 under-
 documented.

Methinks you are just seeing the change in performance from the allocation
algorithm change.

 
 Most (maybe all?) file systems perform badly when out of space. I believe we 
 give a recommended
 free size and I thought it was 90%.
 
 In this situation, not only writes suffered, but as a side effect reads also
 came to a nearly complete halt.

If you have atime=on, then reads create writes.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
ZFS and performance consulting
http://www.RichardElling.com












___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS performance near zero on a very full pool

2010-09-09 Thread Arne Jansen

Richard Elling wrote:

On Sep 9, 2010, at 10:09 AM, Arne Jansen wrote:


Hi Neil,

Neil Perrin wrote:

NFS often demands it's transactions are stable before returning.
This forces ZFS to do the system call synchronously. Usually the
ZIL (code) allocates and writes a new block in the intent log chain to achieve 
this.
If ever it fails to allocate a block (of the size requested) it it forced
to close the txg containing the system call. Yes this can be extremely
slow but there is no other option for the ZIL. I'm surprised the wait is 30 
seconds.
I would expect mush less, but finding room for the rest of the txg data and 
metadata
would also be a challenge.

I think this is not what we saw, for two reason:
a) we have a mirrored slog device. According to zpool iostat -v only 16MB
   out of 4GB were in use.
b) it didn't seem like the txg would have been closed early. Rather it kept
   approximately the 30 second intervals.

Internally we came up with a different explanation, without any backing that
it might be correct: When the pool reaches 96%, zfs goes into a 'self defense'
mode. Instead of allocating block from ZIL, every write turns synchronous and
has to wait for the txg to finish naturally. The reasoning behind this might
be that even if ZIL is available, there might not be enough space left to commit
the ZIL to the pool. To prevent this, zfs doen't use ZIL when the pool is above
96%. While this might be proper for small pools, on large pools 4% are still
some TB of free space, so there should be an upper limit of maybe 10GB on this
hidden reserve.


I do not believe this is correct.  At 96% the first-fit algorithm changes to 
best-fit
and ganging can be expected. This has nothing to do with the ZIL.  There is 
already a reserve set aside for metadata and the ZIL so that you can remove

files when the file system is 100% full.  This reserve is 32 MB or 1/64 of the 
pool
size.


Maybe it is some side-effect of this change of allocation scheme. But I'm very
sure about what I saw. The change was drastic and abrupt. I had a dtrace script
running that measured the time for rfs3_write to complete. With the pool 96%
I saw a burst of writes every 30 seconds, with completion times of up to 30s.
With the pool  96%, I saw a continuous stream of writes with completion times
of mostly a few microseconds.




In this situation, not only writes suffered, but as a side effect reads also
came to a nearly complete halt.


If you have atime=on, then reads create writes.


atime is off. The impact on reads/lookups/getattr came imho because all server
threads have been occupied by blocking writes for a prolonged time.

I'll try to reproduce this on a test machine.

--
Arne

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss