Re: Raid resync changes buffer cache semantics --- not good for journaling!

Stephen C. Tweedie Tue, 2 Nov 1999 10:24:32 -0800
Hi,

On Tue, 2 Nov 1999 14:12:00 +0100 (MET), Ingo Molnar
<[EMAIL PROTECTED]> said:

> yes but this means that the block was not cached. 

OK... but raid resync _will_ block forever as it currently stands.

>> > 2.3 removes physical indexing of cached blocks, 
>> 
>> 2.2 never guaranteed that IO was from cached blocks in the first place.
>> Swap and paging both bypass the buffer cache entirely. [..]

> no, paging (named mappings) writes do not bypass the buffer-cache, and
> thats the issue. 

Mappings bypass it for read in 2.2, and for read _and_ write on 2.3.  I
wasn't talking about writes: I'm talking about IO in general.  IO is not
limited to the buffer cache, and the places where the buffer cache is
bypassed are growing, not shrinking.

> I agree that swapping is a problem (bug) even in 2.2, thanks for pointing
> it out. (It's not really hard to fix because the swap cache is more or
> less physically indexed.) 

Yep.  Journaling will have the same problem.  The block device interface
has never required that the writes come from the buffer cache.

>> But you cannot rely on the buffer cache.  If I "dd" to a swapfile and do
>> a swapon, then the swapper will start to write to that swapfile using
>> temporary buffer_heads.  If you do IO or checksum optimisation based on
>> the buffer cache you'll risk plastering obsolete data over the disks.  

> i dont really mind how it's called. It's a physical index of all dirty &
> cached physical device contents which might get written out directly to
> the device at any time. In 2.2 this is the buffer-cache.

No it isn't.  The buffer cache is a partial cache at best.  It does not
record all writes, and certainly doesn't record all reads, even on 2.2.
Most importantly, data in the buffer cache cannot be written arbitrarily
to disk at any time by the raid code: you'll totally wreck any write
ordering attempts by higher level code.

> Think about it, it's not a hack, it's a solid concept. The RAID code
> cannot even create its own physical index if the cache is completely
> private. Should the RAID code re-read blocks from disk when it
> calculates parity, just because it cannot access already cached data
> in the pagecache?

It can't access the page cache in 2.2.

> The RAID code is not just a device driver, it's also a cache
> manager. Why do you think it's inferior to access cached data along a
> physical index?

Ask Linus, he's pushing this point much more strongly than I am!  The
buffer cache will become less and less of a cache as time goes on in his
grand plan: it is to become little more than an IO buffer layer.

Basically, for the raid code to poke around in higher layers is a huge
layering violation.  We are heading towards doing things like adding
kiobuf interfaces to ll_rw_block (in which the IO descriptor that the
driver receives will have no reference to the buffer cache), and 
and raw, unbuffered access to the drivers for raw devices and O_DIRECT.
Raw IO is already there and bypasses the buffer cache.  So does swap.
So does journaling.  So does page-in (in 2.2) and page-out (in 2.3).

I'll not pretend that this doesn't pose difficulties for raid, but
neither do I believe that raid should have the write to be a cache
manager, deciding on its own when to flush stuff to disk.

There is a huge off-list discussion in progress with Linus about this
right now, looking at the layering required to add IO ordering.  We have
a proposal for per-device IO barriers, for example.  If raid ever thinks
it can write to disk without a specific ll_rw_block() request, we are
lost.  Sorry.  You _must_ observe write ordering.

Peeking into the buffer cache for reads is a much more benign behaviour.
It is still going to be a big problem for things like journaling and raw
IO, and is a potential swap corrupter, but we can fix these things by
being religious about removing or updating any buffer-cache copies of
disk blocks we're about to write to bypassing the buffer cache.

Right now the raid resync can clearly write buffers without being asked
to do so, and that needs to be fixed.  It should be possible to do so
without redesigning the whole of software raid.  Can we assume that
apart from that, raid*.c never writes data without being asked, even if
it does use the buffer cache to compute parity?  

> well, we are not talking about non-cached IO here. We are talking about a
> new kind of (improved) page cache that is not physically indexed. _This_
> is the problem. If the page-cache was physically indexed then i could look
> it up from the RAID code just fine. If the page-cache was physically
> indexed (or more accurately, the part of the pagecache that is already
> mapped to a device in one way or another, which is 90+% of it.) then the
> RAID code could obey all the locking (and additional delaying) rules
> present there. 

It is insane to think that a device driver (which raid *is*, from the
point of view of the page cache) should have a right to poke about in a
virtually indexed cache.

> i think your problem is that you do not accept the fact that the RAID code
> is a cache manager/cache user. 

Correct, I don't.  It absolutely must not be a cache manager.  I can see
why you want it to be a cache user, and as long as we restrict that to
the physical cache, fair enough.  As soon as raid thinks that it can
manage the cache, we're in big trouble: there are lots of other users of
IO wanting to send data to drivers.  O_DIRECT is a good example: you can
bet that we'll shortly be seeing filesystems which perform file IO
straight to ll_rw_block without use of the buffer cache (as raw IO does
for devices already).

--Stephen
Re: Raid resync changes buffer cache semantics --- not good for journaling!

Reply via email to