On Sep 9, 2010, at 10:09 AM, Arne Jansen wrote: > Hi Neil, > > Neil Perrin wrote: >> NFS often demands it's transactions are stable before returning. >> This forces ZFS to do the system call synchronously. Usually the >> ZIL (code) allocates and writes a new block in the intent log chain to >> achieve this. >> If ever it fails to allocate a block (of the size requested) it it forced >> to close the txg containing the system call. Yes this can be extremely >> slow but there is no other option for the ZIL. I'm surprised the wait is 30 >> seconds. >> I would expect mush less, but finding room for the rest of the txg data and >> metadata >> would also be a challenge. > > I think this is not what we saw, for two reason: > a) we have a mirrored slog device. According to zpool iostat -v only 16MB > out of 4GB were in use. > b) it didn't seem like the txg would have been closed early. Rather it kept > approximately the 30 second intervals. > > Internally we came up with a different explanation, without any backing that > it might be correct: When the pool reaches 96%, zfs goes into a 'self defense' > mode. Instead of allocating block from ZIL, every write turns synchronous and > has to wait for the txg to finish naturally. The reasoning behind this might > be that even if ZIL is available, there might not be enough space left to > commit > the ZIL to the pool. To prevent this, zfs doen't use ZIL when the pool is > above > 96%. While this might be proper for small pools, on large pools 4% are still > some TB of free space, so there should be an upper limit of maybe 10GB on this > hidden reserve.
I do not believe this is correct. At 96% the first-fit algorithm changes to best-fit and ganging can be expected. This has nothing to do with the ZIL. There is already a reserve set aside for metadata and the ZIL so that you can remove files when the file system is 100% full. This reserve is 32 MB or 1/64 of the pool size. > Also this sudden switch of behavior is completely unexpected and at least > under- > documented. Methinks you are just seeing the change in performance from the allocation algorithm change. > >> Most (maybe all?) file systems perform badly when out of space. I believe we >> give a recommended >> free size and I thought it was 90%. > > In this situation, not only writes suffered, but as a side effect reads also > came to a nearly complete halt. If you have atime=on, then reads create writes. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com ZFS and performance consulting http://www.RichardElling.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss