Re: (reiserfs) Re: Raid resync changes buffer cache semantics --- not good for journaling!

Ingo Molnar Wed, 3 Nov 1999 08:42:35 -0800

On Wed, 3 Nov 1999, Stephen C. Tweedie wrote:

> You can get around this problem by snapshotting the buffer cache and
> writing it to the disk, of course, [...]

.. which is exactly what the RAID5 code was doing ever since. It _has_ to
do it to get 100% recovery anyway. This is one reason why access to caches
is so important to the RAID code. (we do not snapshot clean buffers) 

> > sure it can. In 2.2 the defined thing that prevents dirty blocks from
> > being written out arbitrarily (by bdflush) is the buffer lock. 
> 
> Wrong semantics --- the buffer lock is supposed to synchronise actual
> physical IO (ie. ll_rw_block) and temporary states of the buffer.  It
> is not intended to have any relevance to bdflush.

because it synchronizes IO it _obviously_ also synchronizes bdflush access
because bdflush does nothing but keeps balance and starts IO! You/we might
want to make this mechanizm more explicit, i dont mind.

> > this is a misunderstanding! RAID will not and does not flush anything to
> > disk that is illegal to flush.
> 
> raid resync currently does so by writing back buffers which are not
> marked dirty.

no, RAID marks buffers dirty which it found in a clean and not locked
state in the buffer-cache. It's perfectly legal to do so - or at least it
was perfectly legal until now, but we can make the rule more explicit. I
always just _suggested_ to use the buffer lock. 

> > I'd like to have ways to access mapped & valid cached data from the
> > physical index side.
> 
> You can't.  You have never been able to assume that safely.
> 
> Think about ext2 writing to a file in any kernel up to 2.3.xx.  We do
> the following:
> 
>       getblk()
>       ll_rw_block(READ) if it is a partial write
>       copy_from_user()
>       mark_buffer_dirty()
>       update_vm_cache()
> 
> copy_from_user has always been able to block.  Do you see the problem?
> We have wide-open windows in which the contents of the buffer cache have
> been modified but the buffer is not marked dirty. [...]

thanks, i now see the problem. Thinking about it, i do not see any
conceptual problem. Right now the pagecache is careless about keeping the
physical index state correct, because it can assume exclusive access to
that state through higher level locks.

> There are also multiple places in ext2 where we call mark_buffer_dirty()
> on more than one buffer_head after an update.  mark_buffer_dirty() can
> block, so there again you have a window where you risk viewing a
> modified but not dirty buffer.
>
> So, what semantics, precisely, do you need in order to calculate parity?
> I don't see how you can do it reliably if you don't know if the
> in-memory buffer_head matches what is on disk.

ok, i see your point. I guess i'll have to change the RAID code to do the
following:

#define raid_dirty(bh)  (buffer_dirty(bh) && (buffer_count(bh) > 1))

because nothing is allowed to change a clean buffer without having a
reference to it. And nothing is allowed to release a physical index before
dirtying it. Does this cover all cases?

> That's fine, but we're disagreeing about what the rules are.  Everything
> else in the system assumes that the rule for device drivers is that
> ll_rw_block defines what they are allowed to do, nothing else.  If you
> want to change that, then we really need to agree exactly what the
> required semantics are.

agreed.

> > O_DIRECT is not a problem either i believe. 
> 
> Indeed, the cache coherency can be worked out.  The memory mapped
> writable file seems a much bigger problem.

yep.

> > the physical index and IO layer should i think be tightly integrated. This
> > has other advantages as well, not just RAID or LVM: 
> 
> No.  How else can I send one copy of data to two locations in
> journaling, or bypass the cache for direct IO?  This is exactly what I
> don't want to see happen.

for direct IO (with the example i have given to you) you are completely
bypassing the cache, you are not bypassing the index! You are doing
zero-copy, and the buffer does not stay cached.

-- mingo
Re: (reiserfs) Re: Raid resync changes buffer cache semantics --- not good for journaling!

Reply via email to