On Mar 15, 2006 20:27 +0100, Andreas Sch�fer wrote:
> If it was that easy... The problem for openMosix is that most devices
> fetch data in 4k blocks via copy_from_user(). For migrated processes,
> openMosix intercepts these calls and forwards them to the node which
> currently hosts the process. This forwarding yields a high latency
> penalty.
>
> Obviously there are two ways to get rid of this problem:
>
> * modify _every_ Linux device driver to use a
> _a_lot_more_than_4k_at_a_time_ approach or
>
> * implement a second "read ahead" buffer which fetches large blocks via
> the network in the background and answers calls to copy_from_user()
> directly from the local buffer
Or you can use a network filesystem like Lustre that handles this
itself ;-). Sadly, though, it has to do both of these to get
good performance, via {sub,per}version of the VFS/VM.
Clients do delayed-write (writeback cache, with write credits from
the server to accound for space) to avoid small RPCs. They also
do large amounts of readahead (in large chunks) to improve reads
for applications and the VM that breaks up all reads into 4kB chunks.
Servers also do batch block allocation and then large direct writes
instead of going through the VFS/VM. There are still a number of
device drivers that break up bios into chunks smaller than 1MB, and
that hurts performance.
Having a generic delayed/batch allocation mechanism is definitely
the right way to go, and from my reading of linux-fsdevel this is
underway by some folks at IBM. Since we have to support customers
dating back to 2.4.21 it will be a while before we can move over to
the newer APIs, once they are available.
> BTW: how are you guys planning to solve this 4k issue? Will you revert
> to small blocks or will you "pretend" to perform 4k transfers and
> assemble those in the background to, again, process large chunks at
> once? If yes, wouldn't this seriously increase CPU usage due to
> (most likely) unnecessary data duplication?
It doesn't result in data duplication, per se, since the pages are
copied into kernel space only once. What it does mean is that there
needs to be a duplication of infrastructure in order to reassemble
and track all of these pages.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.