Alexander Viro wrote:
> On Wed, 26 Jul 2000, Stephen C. Tweedie wrote:
> > On Wed, Jul 26, 2000 at 03:19:46PM -0400, Alexander Viro wrote:
> >
> > > Erm? Consider that: huge lseek() + write past the end of file. Woops - got
> > > to unmerge the tail (it's an internal block now) and we've got no
> > > knowledge of IO going on the page. Again, IO may be asynchronous - no
> > > protection from i_sem for us. After that page becomes a regular one,
> > > right? Looks like a change of state to me...
> >
> > Naturally, and that change of state must be made atomically by the
> > filesystem.
> 
> Yep. Which is the point - there _are_ dragons. I believe that it's doable,
> but I realy want to repeat: Daniel, watch out for races at the moments
> when page state changes, it needs more accurate approach than usual
> pagecache-using fs. It can be done, but it will take some reading (and
> yes, Stephen, I know that _you_ know it ;-)

That's apparent, and I feel that Stephen could probably implement the entire
tail merge as described so far in few days.  But that wouldn't be as useful as
having me and perhaps some interested observers others go all the way through
the exercise of figuring out the so-far unwritten rules of the
buffercache/pagecache duo.

The exact same accurate work is required for Tux2, which makes massive use of
copy-on-write.  Right now, buffer issues are the main thing standing in the way
of making a development code release for Tux2.  So there is no question in my
mind about whether such issues have to be dealt with: they do.

I dove into the 2.4.0 cache code for the first time last night (using lxr - try
it, you'll like it) and I'm almost at the point where I have some relevant
questions to ask.  I notice that buffer.c has increased in size by almost 50%
and is far and away the largest module in the VFS.  Worse, buffer.c is massively
cross-coupled to the mm subsystem and the page cache, as we know too well. 
Buffer.c is right at the core of the issues we're talking about.

Bearing that in mind, instead of just jumping in and starting to code I'll try
the methodical approach :-)  My immediate objective is to try clarify a few
things that aren't immediately obvious from the source, in the following areas:

  - States and transitions for the main objects:
    - Buffer heads
    - Buffer data
    - Page heads
    - Page data
    - Other?

  - Existing concurrency controls:
    - Semaphores/Spinlocks
    - Big kernel lock
    - Filesystem locks
    - Posix locks?
    - Other?

  - Planned additions/deletions of concurrency controls

I will also try to make a list of the main internal functions in the VFS (and
some related ones from the mm and drivers modules) and examine
function-by-function what the intended usage is, what the issues/caveats are,
and maybe even how we can expect them to evolve in the future.

I think we need even more than this in terms of documentation in order to work
effectively, but this at least will be a good start.  It will be more than what
we have now.  If it gets to the point where we can actually answer questions
about race conditions by consulting the docs then we really will have
accomplished something.  Yes, I know that the code is going to keep evolving and
sometimes will break the docs, but I also have confidence that the docs can keep
up with such evolution given some interested volunteer doc maintainers willing
to hang out on the devel list and keep asking questions.

Even in 2.2.x I felt that there is a lot of understated elegance in Linux's
buffer cache design.  In 2.4.0 it seems to be getting more elegant, although
it's hard to say exactly, because of the sparse (read: nonexistent)
documentation.  This is a problem that can be easily fixed.

To get through this I will have to ask a lot of naive-sounding questions. 
Hopefully I'll have the first batch ready this afternoon (morning, your time).

-- 
Daniel

Reply via email to