http://oss.sgi.com/archives/xfs/2007-12/msg00641.htmlOn Tue, Dec 18, 2007 at 06:44:05PM +1100, David Chinner wrote: > On Tue, Dec 18, 2007 at 03:12:02PM +1100, Lachlan McIlroy wrote: > > Since I have been able to reproduce some of our NAS/NFS performance problems > > without NFS (that is demonstrate that the problems are in XFS) it makes some > > sense to fix these in XFS. I have observed that for some non-NFS workloads > > we see a significant reduction in log traffic with the OFC in XFS so for > > reasons beyond NFS there may be a need to reactivate the refcache code. For > > the moment we are still analysing the pros/cons. > > Reactivating the ref cache is fundamentally the wrong thing to do. > Most of these problems come from the mismatch of inode life cycles > between Linux and XFS and this is the basic problem we need to solve. > > For example - do the open-write-close related performance issues go > away if you remove the xfs_free_eofblocks() call in xfs_release()? > i.e. are we just being stupid about the way we deal with closing > of file descriptors? > > This should work because the linux inode will remain around with a > ref-count of 1 on the unused list due to the dentry pinning it > in place. Only when the dentry gets reclaimed (e.g. memory pressure, > unlink, unmount, etc) will the truncate occur, and hence repeated > single file open-write-close based workloads (like the nfsd) don't > issue a truncate transaction and trash the EOF preallocation on > every close.... > > And look at the code - the *only thing* the refcache does is avoid > the truncate in xfs_release(). So, the patch below is the equivalent > of re-introducing the refcache into XFS but uses the linux inode > life cycle to keep references around. > > FWIW, this means that EOF pre-allocations will not get trimmed > immediately which may have disk usage implications for users with > small quotas, those that create lots of small files, or there are > lots of written inodes with prealocated space cached in memory > when a crash occurs. FYI - numbers to back this up. As an example of where the failure to truncate EOF blocks (i.e. speculative preallocation) is bad, try creating several thousand small files (say 1 byte) and seeing how long they take to sync to disk. With EOF truncation, we get all the data blocks allocated adjacently so the elevator merges them together and we see large I/Os going to disk (i.e. 512k I/os going to disk where 128 different file datai writes have been merged). Without EOF truncation, these files retain their speculative allocation (default 64k) and so when written out we get a stream of 4k I/Os separated by 64k. That is, one seek per inode written out instead of it being large sequential I/O convering 128 files per I/O. To demonstrate, sequntial creation of 1 byte files in a 30s period followed by a (timed) sync: With EOF truncation: Creates | Deletes Loads Files rate usr sys intr csw/s| rate usr sys intr csw/s ----- ------- ---- ----- ----- ----- -----| ---- ----- ----- ----- ----- 1 39312 1070 6.8 91.8 4.3 1959 1572 0.9 107.1 0.4 1109 2 68458 1627 9.9 155.4 2.6 1316 2535 1.7 207.5 0.7 1157 Without EOF truncation: Creates | Deletes Loads Files rate usr sys intr csw/s| rate usr sys intr csw/s ----- ------- ---- ----- ----- ----- -----| ---- ----- ----- ----- ----- 1 42691 461 3.0 37.7 2.2 1535 1579 1.0 123.2 0.6 1105 2 72785 530 3.3 44.4 2.8 1684 1774 37.8 179.7 5.1 2754 Note that without EOF truncation we create 5-10% more files in the 30s period this test ran for (due to it being CPU bound and not issuing empty EOF truncation transactions), but the overall rate includes the time it takes to write the data to disk as well. The data write is far slower without EOF truncation.... Hence we see that the overall create+data write rate suffers *greatly* due to the lack of EOF truncation, and hence why avoiding EOF truncation on file close for local I/O is generally considered a bad thing. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group |