Hi,

On Wed, 3 Nov 1999 17:43:18 +0100 (MET), Ingo Molnar
<[EMAIL PROTECTED]> said:

> .. which is exactly what the RAID5 code was doing ever since. It _has_ to
> do it to get 100% recovery anyway. This is one reason why access to caches
> is so important to the RAID code. (we do not snapshot clean buffers) 

And that's exactly why it will break --- the dirty bit on a buffer does
not tell you if it is clean or not, because the standard behaviour all
through the entire VFS is to modify the buffer before calling
mark_buffer_dirty().  There have always been windows in which you can
see a buffer as clean but it in fact is dirty, and the writable mmap
case in 2.3 makes this worse.

> no, RAID marks buffers dirty which it found in a clean and not locked
> state in the buffer-cache. It's perfectly legal to do so - or at least it
> was perfectly legal until now,

No it wasn't, because swap has always been able to bypass the buffer
cache.  I'll not say it was illegal either --- let's just say that the
situation was undefined --- but we have had writes outside the buffer
cache for _years_, and you simply can't say that it has always been
legal to assume that the buffer cache was the sole synchronisation
mechanism for IO.

I think we can make a pretty good compromise here, however.  We can
mandate that any kernel component which bypasses the buffer cache is
responsible for ensuring that the buffer cache is invalidated
beforehand.  That lets raid do the right thing regarding parity
calculations.  

For this to work, however, the raid resync must not be allowed to
repopulate the buffer cache and create a new cache incoherency.  It
would not be desparately hard to lock the resync code against other IOs
in progress, so that resync is entirely atomic with respect to things
like swap.

I can live with this for jfs: I can certainly make sure that I bforget()
journal descriptor blocks after IO to make sure that there is no cache
incoherency if the next pass over the log writes to the same block using
a temporary buffer_head.  Similarly, raw IO can do buffer cache
coherency if necessary (but that will be a big performance drag if the
device is in fact shared).

The one thing that I really don't want to have to deal with is the raid
resync code doing its read/wait/write thing while I'm writing new data
via temporary buffer_heads, as that _will_ corrupt the device in a way
that I can't avoid.  There is no way for me to do metadata journal
writes through the buffer cache without copying data (because the block
cannot be in the buffer cache twice), so I _have_ to use cache-bypass
here to avoid an extra copy.

> ok, i see your point. I guess i'll have to change the RAID code to do the
> following:

> #define raid_dirty(bh)        (buffer_dirty(bh) && (buffer_count(bh) > 1))

> because nothing is allowed to change a clean buffer without having a
> reference to it. And nothing is allowed to release a physical index before
> dirtying it. Does this cover all cases?

It should do --- will you then do parity calculation and a writeback on
a snapshot of such buffers?  If so, we should be safe.

>> No.  How else can I send one copy of data to two locations in
>> journaling, or bypass the cache for direct IO?  This is exactly what I
>> don't want to see happen.

> for direct IO (with the example i have given to you) you are completely
> bypassing the cache, you are not bypassing the index! You are doing
> zero-copy, and the buffer does not stay cached.

Registering and deregistering every 512-byte block for raw IO is a CPU
nightmare, but I can do it for now.

However, what happens when we start wanting to pass kiobufs directly
into ll_rw_block()?  For performance, we really want to be able to send
chunks larger than a single disk block to the driver in one go.

--Stephen

Reply via email to