Hi Neil,

Neil Perrin wrote:
NFS often demands it's transactions are stable before returning.
This forces ZFS to do the system call synchronously. Usually the
ZIL (code) allocates and writes a new block in the intent log chain to achieve this.
If ever it fails to allocate a block (of the size requested) it it forced
to close the txg containing the system call. Yes this can be extremely
slow but there is no other option for the ZIL. I'm surprised the wait is 30 seconds. I would expect mush less, but finding room for the rest of the txg data and metadata
would also be a challenge.

I think this is not what we saw, for two reason:
 a) we have a mirrored slog device. According to zpool iostat -v only 16MB
    out of 4GB were in use.
 b) it didn't seem like the txg would have been closed early. Rather it kept
    approximately the 30 second intervals.

Internally we came up with a different explanation, without any backing that
it might be correct: When the pool reaches 96%, zfs goes into a 'self defense'
mode. Instead of allocating block from ZIL, every write turns synchronous and
has to wait for the txg to finish naturally. The reasoning behind this might
be that even if ZIL is available, there might not be enough space left to commit
the ZIL to the pool. To prevent this, zfs doen't use ZIL when the pool is above
96%. While this might be proper for small pools, on large pools 4% are still
some TB of free space, so there should be an upper limit of maybe 10GB on this
hidden reserve.
Also this sudden switch of behavior is completely unexpected and at least under-
documented.


Most (maybe all?) file systems perform badly when out of space. I believe we give a recommended
free size and I thought it was 90%.

In this situation, not only writes suffered, but as a side effect reads also
came to a nearly complete halt.

--
Arne



Neil.

On 09/09/10 09:00, Arne Jansen wrote:
Hi,

currently I'm trying to debug a very strange phenomenon on a nearly full
pool (96%). Here are the symptoms: over NFS, a find on the pool takes
a very long time, up to 30s (!) for each file. Locally, the performance
is quite normal.
What I found out so far: It seems that every nfs write (rfs3_write) blocks until the txg is flushed. This means a write takes up to 30 seconds. During this time, the nfs calls block, occupying all NFS server threads. With all
server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait
until the writes finish, bringing the performance of the server effectively
down to zero.
It may be that the trigger for this behavior is around 95%. I managed to bring the pool down to 95%, now the writes get served continuously as it should be.

What is the explanation for this behaviour? Is it intentional and can the
threshold be tuned? I experienced this on Sol10 U8.

Thanks,
Arne
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to