On Tue, 2 Nov 1999, Stephen C. Tweedie wrote:
> OK... but raid resync _will_ block forever as it currently stands.
{not forever, but until the transaction is committed. (it's not even
necessary for the RAID resync to wait for locked buffers, it could as well
skip over locked & dirty buffers - those will be written out anyway.) }
> > no, paging (named mappings) writes do not bypass the buffer-cache, and
> > thats the issue.
>
> Mappings bypass it for read in 2.2, and for read _and_ write on 2.3. I
> wasn't talking about writes: I'm talking about IO in general. IO is not
> limited to the buffer cache, and the places where the buffer cache is
> bypassed are growing, not shrinking.
i said writes; reads are irrelevant in this context. in our case we were
mainly worried about cache coherency issues, obviously only dirty data
(ie. writes) are interesting in this sense. OTOH i do agree that for
better generic performance the RAID code would like to see cached reads as
well, not only dirty data, and this is a problem on 2.2 as well.
i'd like to repeat that i do not mind what the mechanizm is called,
buffer-cache or physical-cache or pagecache-II or whatever. The fact that
the number of 'out of the blue sky' caches and IO is growing is a _design
bug_, i believe. Not too hard to fix at this point, and i'm aware of the
possible speed impact, which i'd like to minimize as much as possible. I
do not mind invisible but coherent dirty-caches, we always had those: eg.
the inode cache, or not yet mapped dirty pages (not really present right
now, but possible with lazy allocation).
But whenever a given page is known to be 'bound' to a given physical block
on a disk/device and is representative for the content, i'd like to have
ways to access/manage these cache elements. ('manage' obviously does not
include 'changing contents', it means changing the state of the cache
element in a defined way, eg. marking it dirty, or marking it clean,
unmapping it, remapping it, etc.)
> > I agree that swapping is a problem (bug) even in 2.2, thanks for pointing
> > it out. (It's not really hard to fix because the swap cache is more or
> > less physically indexed.)
>
> Yep. Journaling will have the same problem. The block device interface
> has never required that the writes come from the buffer cache.
yes, i do not require it either. But i'd like to see a way to access
cached contents 'from the physical side' as well - in cases where this is
possible. (and all other cases should -and currently do- stay coherent
explicitly)
> > i dont really mind how it's called. It's a physical index of all dirty &
> > cached physical device contents which might get written out directly to
> > the device at any time. In 2.2 this is the buffer-cache.
>
> No it isn't. The buffer cache is a partial cache at best. It does not
> record all writes, and certainly doesn't record all reads, even on 2.2.
it's a partial cache i agree, but in 2.2 it is a good and valid way to
ensure data coherency. (except in the swapping-to-a-swap-device case,
which is a bug in the RAID code)
> Most importantly, data in the buffer cache cannot be written arbitrarily
> to disk at any time by the raid code: you'll totally wreck any write
> ordering attempts by higher level code.
sure it can. In 2.2 the defined thing that prevents dirty blocks from
being written out arbitrarily (by bdflush) is the buffer lock. bdflush is
a cache manager similar to the RAID code! 'Write ordering attempts by
higher level' first have to be defined Stephen, and sure if/when these
write ordering requirements become part of the buffer-cache then the RAID
code will listen to it. But you cannot blame a 3 years old concept to not
work on a 6 months old new code.
i'd like to re-ask the question why locking buffers is not good to keep
transactions pending. This solves the bdflush and RAID issue without _any_
change to the buffer-cache. There might be practical/theoretical reasons
for this not being possible/desired, please enlighten me if this is the
case. Again, i do not mind having another 'write ordering' mechanizm
either, but these should first be defined and agreed on.
> It can't access the page cache in 2.2.
(yes, i'd like to fix this in the 2.3 RAID code, additionally to being
nice to the journalling code as well.)
> > The RAID code is not just a device driver, it's also a cache
> > manager. Why do you think it's inferior to access cached data along a
> > physical index?
>
> Ask Linus, he's pushing this point much more strongly than I am! The
> buffer cache will become less and less of a cache as time goes on in his
> grand plan: it is to become little more than an IO buffer layer.
we should not be too focused on the buffer-cache. The buffer-cache has
many 'legacy' features (eg. <PAGE_SIZE buffer sizes) i agree. But it's not
that bad anymore, Davem's SMP-threading rewrite/redesign simplified it
alot, and now it's a lot less scary than eg. the 2.0 buffer-cache code.
> I'll not pretend that this doesn't pose difficulties for raid, but
> neither do I believe that raid should have the write to be a cache
> manager, deciding on its own when to flush stuff to disk.
this is a misunderstanding! RAID will not and does not flush anything to
disk that is illegal to flush.
> There is a huge off-list discussion in progress with Linus about this
> right now, looking at the layering required to add IO ordering. We have
> a proposal for per-device IO barriers, for example. If raid ever thinks
> it can write to disk without a specific ll_rw_block() request, we are
> lost. Sorry. You _must_ observe write ordering.
it _WILL_ listen to any defined rule. It will however not be able to go
telephatic and guess any future interfaces.
> Peeking into the buffer cache for reads is a much more benign behaviour.
> It is still going to be a big problem for things like journaling and raw
> IO, and is a potential swap corrupter, but we can fix these things by
> being religious about removing or updating any buffer-cache copies of
> disk blocks we're about to write to bypassing the buffer cache.
I'd like to have ways to access mapped & valid cached data from the
physical index side.
> [...] Can we assume that
> apart from that, raid*.c never writes data without being asked, even if
> it does use the buffer cache to compute parity?
think of RAID as a normal user process, and as such it can do and wants to
do anything that is within the rules. Why shouldnt we give access to
certain caches if that is beneficial and doesnt impact the generic case
too much?
> > well, we are not talking about non-cached IO here. We are talking about a
> > new kind of (improved) page cache that is not physically indexed. _This_
> > is the problem. If the page-cache was physically indexed then i could look
> > it up from the RAID code just fine. If the page-cache was physically
> > indexed (or more accurately, the part of the pagecache that is already
> > mapped to a device in one way or another, which is 90+% of it.) then the
> > RAID code could obey all the locking (and additional delaying) rules
> > present there.
>
> It is insane to think that a device driver (which raid *is*, from the
> point of view of the page cache) should have a right to poke about in a
> virtually indexed cache.
RAID is implemented through kernel threads, and such it's very capable of
handling arbitrary high-level data structures (and it will obey the rules
of those data structures). I'd agree with your points if RAID was a normal
device driver living in IRQ contexts, but that is not the case.
> > i think your problem is that you do not accept the fact that the RAID code
> > is a cache manager/cache user.
>
> Correct, I don't. It absolutely must not be a cache manager. I can see
> why you want it to be a cache user, and as long as we restrict that to
> the physical cache, fair enough.
I'm talking about _DUAL_ (physical and virtual) indices, not dual caches.
Same data block, two ways to access it. This is not really an aliased
cache, it's two fundamental indices: the physical and the virtual. And the
number of indices is not going to increase. Most highlevel code will use
it's own virtual (and exclusive) index to a page.
There is not even any huge cost associated with this, every pagecache-page
already has mappings (and bhs) established to the physical device. These
can serve as the physical index just fine. Additionally, in 2.4 we will be
doing some kind of unmap_metadata_block() anyway. (which is exactly the
physical index) flushpage removes ->buffers from the physical index as
well. So i'd risk that right now a complete physical index would be almost
invisible.
> [...] As soon as raid thinks that it can
> manage the cache, we're in big trouble: there are lots of other users of
> IO wanting to send data to drivers. O_DIRECT is a good example: you can
> bet that we'll shortly be seeing filesystems which perform file IO
> straight to ll_rw_block without use of the buffer cache (as raw IO does
> for devices already).
O_DIRECT is not a problem either i believe. For O_DIRECT to work that
particular physical index block has to be replaced by the directly-written
block. 'replace' is flush-and-add cache operation. The flush will almost
never trigger these days, and add will be fast as well. Presence in the
physical index assures that even if O_DIRECT is delayed/blocked for
whatever reason, the data block is already the authorative content for
that particular physical block. Once direct IO finishes, the block is
removed from the physical index. The block does not have to be virtually
indexed at all. The RAID code synchronizes with the physical index, so
coherency is assured. Synchronizing O_DIRECT with the physical index does
not make O_DIRECT any less direct.
the physical index and IO layer should i think be tightly integrated. This
has other advantages as well, not just RAID or LVM:
- 'physical-index driven write clustering': the IO layer could 'discover'
that a neighboring physical block is dirty as well, and might merge with
it. (strictly only if defined rules permit it, so it's _not_
unconditional)
another example:
- 'partly physical-index driven readahead': the page-cache might map
'potentially interesting' pages into the physical index, without actually
starting readahead on it. The IO layer, whenever starting/continuing
whatever read operation could 'snoop' the physical index for 'potential'
readahead requests much more intelligently, and basically seemlessly
'merge' real reads with readahead.
or:
- 'physical-index driven defragmentation': the physical index could as
well collect usage information, and feed this back into filesystems to do
usage-based block reordering.
(i do not claim that these are necesserily useful, but the concept does
appear to be solid to me)
having a pure virtual index to cached data will work fine on boxes that do
not have to worry about IO performance, but doing any advanced IO feature
(including RAID) will i believe be twice as hard, and impossible /
impractical in some cases.
-- mingo