Re: calling all fs experts

2011-12-11 Thread Julian H. Stacey
Hi,
Reference:
 From: Maksim Yevmenkin maksim.yevmen...@gmail.com 

Maksim Yevmenkin wrote:
 Hello,
 
 i have a question for fs wizards.

There is a list for them:
freebsd...@freebsd.org

Cheers,
Julian
-- 
Julian Stacey, BSD Unix Linux C Sys Eng Consultants Munich http://berklix.com
 Reply below not above, cumulative like a play script,  indent with  .
 Format: Plain text. Not HTML, multipart/alternative, base64, quoted-printable.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: calling all fs experts

2011-12-11 Thread Kostik Belousov
On Sat, Dec 10, 2011 at 05:42:01PM -0800, Maksim Yevmenkin wrote:
 Hello,
 
 i have a question for fs wizards.
 
 suppose i can persuade modern spinning disk to do large reads (say
 512K to 1M) at a time. also, suppose file system on such modern
 spinning drive is used to store large files (tens to hundreds of
 megabytes). is there any way i can tweak the file system parameters
 (block size, layout, etc) to help it to get as close to disk's
 sequential read rate as possible. I understand that i will not be
 able to get 100MB/sec single client sequential read rate, but, can i
 get it into sustained 40-50MB/sec rate? also, can i reduce performance
 impact caused by small reads such as directory access etc.

If you wanted to get responses from experts only, sorry in advance.

The fs (AKA UFS) uses clustering provided by the block cache. The clustering
code, mainly located in the kern/vfs_cluster.c, coalesces sequence of
reads or writes that are targeting the consequtive blocks, into single
physical read or write of the maximal size of MAXPHYS. Current definition
of MAXPHYS is 128KB.

Clustering allows filesystem to improve the layout of the files by calling
VOP_REALLOCBLKS() to redo the allocation to make the writing sequence of
blocks sequential if it is not.

Even if file is not layed out ideally, or the i/o pattern is random, most
writes scheduled are asynchronous, and for reads, the system tries to
schedule read-aheads for some limited number of blocks. This allows the
lower layers, i.e. geom and disk drivers, to optimize the i/o queue
to coalesce requests that are consequitive on disk, but not on the queue.

BTW, some time ago I was interested in the effect on the fragmentation
on UFS, due to some semi-abandoned patch, which could make the
fragmentation worse. I wrote the tool that calculated the percentage
of non-consequtive spots in the whole filesystem. Apparently, even
under the hard load consisting of writing a lot of files under the
megabytes in size, UFS managed to keep the number of spots under 2-3% on
sufficiently free volume.


pgpg2apEuMeNy.pgp
Description: PGP signature


Re: calling all fs experts

2011-12-11 Thread Pedro F. Giffuni

--- Dom 11/12/11, Kostik Belousov kostik...@gmail.com ha scritto:

 
 If you wanted to get responses from experts only, sorry in
 advance.


I am no fs expert but just thought I'd mention some things
based on my playing with the BSD ext2fs ...
 
 The fs (AKA UFS) uses clustering provided by the block
 cache. The clustering
 code, mainly located in the kern/vfs_cluster.c, coalesces
 sequence of
 reads or writes that are targeting the consequtive blocks,
 into single
 physical read or write of the maximal size of MAXPHYS.
 Current definition
 of MAXPHYS is 128KB.


The clustering code is really cool and the idea is that it
gives UFS the advantages of an extent based fs.
I haven't seen benchmarks in UFS2 but on ext2 it didn't
seem to work as it should though. 

One issue is that ext2 doesn't support fragments and as
a consequence ext2 will not use big blocksizes. This is a
limitation in the ext2 design that UFS doesn't have, but
still linux's ext2fs outperforms UFS in async mode (we do
shine in sync mode).

It was never clear exactly why this happens but it would
appear there is a bottleneck in geom that is not good in
writing many contiguous blocks.

 Clustering allows filesystem to improve the layout of the
 files by calling
 VOP_REALLOCBLKS() to redo the allocation to make the
 writing sequence of
 blocks sequential if it is not.
 
 Even if file is not layed out ideally, or the i/o pattern
 is random, most
 writes scheduled are asynchronous, and for reads, the
 system tries to
 schedule read-aheads for some limited number of blocks.
 This allows the
 lower layers, i.e. geom and disk drivers, to optimize the
 i/o queue
 to coalesce requests that are consequitive on disk, but not
 on the queue.
 
 BTW, some time ago I was interested in the effect on the
 fragmentation
 on UFS, due to some semi-abandoned patch, which could make
 the
 fragmentation worse. I wrote the tool that calculated the
 percentage
 of non-consequtive spots in the whole filesystem.
 Apparently, even
 under the hard load consisting of writing a lot of files
 under the
 megabytes in size, UFS managed to keep the number of spots
 under 2-3% on
 sufficiently free volume.
 

Yes, the realloc_blk code is very efficient in that. In fact
it is so good it actually hides some inefficient operations
in UFS. Bruce had a patch for this that I cc'd to Kirk but
the difference was not big because the realloc_blk code does
it's job in memory.

Zheng Liu did the reallocation thing for ext2fs and it gave
better results than preallocation but the results are not
as spectacular as in UFS (the UFS code takes advantage of
fragments there too). I do expect to commit it (kern/159233)
once my mentor reviews and approves it.

cheers,

Pedro.

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


calling all fs experts

2011-12-10 Thread Maksim Yevmenkin
Hello,

i have a question for fs wizards.

suppose i can persuade modern spinning disk to do large reads (say
512K to 1M) at a time. also, suppose file system on such modern
spinning drive is used to store large files (tens to hundreds of
megabytes). is there any way i can tweak the file system parameters
(block size, layout, etc) to help it to get as close to disk's
sequential read rate as possible. I understand that i will not be
able to get 100MB/sec single client sequential read rate, but, can i
get it into sustained 40-50MB/sec rate? also, can i reduce performance
impact caused by small reads such as directory access etc.

thanks,
max
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org