Murali Vilayannur wrote:
Hi Phil,
First of all, great work!
There are 2 other parameters that I thought could also make an impact
a) Choice of file-system (this was investigated by Nathan last year, I
think) as well as choice of journaling modes.
We looked at this a little bit again recently, but not in the context of
buffer cache saturation. I don't have numbers handy, but I can share
some general impressions. It still looks like data=writeback is stil
probably the fastest mode in general, although it isn't entirely clear
to me what the filesystem integrity tradeoff is. I had high hopes for
data=journal after reading about experiences other people had (outside
of the PVFS2 world), but it was a bust. data=journal did better than
the default options for berkeley db and metadata intensive access, but
absolutely stunk for I/O throughput.
All of these tests were done with default ext3 options.
b) In case of storage-spaces created directly on top of an IDE/SCSI disks
or attached to local RAID controllers, there must be a way to enable hd
parameters like write-caching/tcq at the disks (hdparm -W /dev/?,...),
are they may be on by default?) (although that might conflict with the
goal of stability of data on disks)
It is interesting that time is not as important a criterion as VM ratio
(i.e. dirty_writeback_centisecs/dirty_expire_centisecs) for such
write-intensive workloads..
I was surprised by this too. I was just ad-hoc turning these values up
and down, so there is a good chance that I missed something, but I
couldn't get any of those values to improve performance in a positive way.
A. Is the AIO interface causing delays?
B. Is the linux kernel waiting too long to start writing out its
buffer cache?
C. Is the linux kernel disk scheduler appropriate for PVFS2?
We can change this behavior by adjusting the /proc/sys/vm/dirty* files.
They are documented in the Documentation/filesystems/proc.txt file in
the linux kernel source. The only one that really ended up being
interesting for us (after trial and error) is the dirty_ratio file. The
explanation given in the documentation is: "Contains, as a percentage of
total system memory, the number of pages at which a process which is
generating disk writes will itself start writing out dirty data.". It
defaults to 40, but some of the results below show what happens when it
is set to 1. There is also a dirty_background_ratio file, which
controls when pdflush decides to write out data in the background. That
would seem to be the more desirable tweak, but it didn't have the effect
that dirty_ratio did for some reason.
Could this because the pdflush daemon does not wake up regularly enough to
start flushing things out? Or does pdflush wake up
a) either if timeout passes (or)
b) ratio is reached?
I am guessing it only wakes up when the timeout passes, but I don't
really know. I tried cranking down the dirty_background_ratio in
conjunction with reducing those *centisecs values so it would wake up
quicker and start writing out, but it still didn't help like the
dirty_ratio did.
Hey, one other thing which struck me. How much memory was there on this
machine? Is this on an IA 32 machine running a kernel with CONFIG_HIGHMEM?
Thanks!
Murali
This was actually a dual proc (appears to be 4 procs with
hyperthreading) xeon box running in x86_64 mode, with 4 GB of memory.
-Phil
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers