Chris Mason wrote:
> --On 10/05/00 13:49:31 +0200 Daniel Phillips wrote:
> > Chris Mason wrote:
> >>
> >> For the most part, reiserfs can play nice with bdflush.  I give it blocks
> >> when I've decided they are ready to get to disk, and I keep blocks away
> >> from it when they aren't allowed to be written.
> >
> > But why not give them straight to ll_rw_block?
> 
> Because I don't want them sent to disk yet ;-)  Let them age a while in the
> bdflush dirty list first.

I'm just trying to get it straight.  You *can* write them now, but
you're not necessarily in a big hurry to, somebody might be able to
write them again if they hang around, and the VM can age them instead of
you, right?

In Tux2's case nobody is allowed to write to a dirty buffer once it has
entered the recording phase (otherwise you would contaminate the
recorded tree) so it doesn't make sense to do anything else than feed it
straight to ll_rw_block.

> > Maybe the real question
> > is, where does the elevator scheduling happen, in ll_rw_block or in
> > bdflush?  I haven't checked.
> 
> bdflush/kupdate decides when to send things to the elevator, the elevator
> queues, merges, and sorts them for the disk.

Yes, that sounds correct.  In Tux2's case, bdflush just doesn't know
what it needs to know to decide when to send them, and it can't really
do useful aging for me, so it's basically a useless appendage - worse
than that, it can cause damage, which I think is the case for you too,
since you keep some dirty buffers away from it.

> >> There have been threads on i/o ordering recently, and that would really
> >> clean things up.  Stephen, I'm assuming you have io ordering in mind for
> >> your queue of 2.5 changes, I'm more than willing to help code something.
> >
> > I/O ordering constraints are complex for journalling filesystems, simple
> > for Tux2.  Tux2 blocks are always partitioned into two groups, plus two
> > metaroots for ordering purposes, and the relationship is simple:  write
> > all of the first group; then its metaroot; let the second group become
> > the first group; wait for a new second group to appear; repeat as
> > necessary.  No outside mechanism is needed to assist this.
> 
> Do you have to wait for the metaroot to reach disk before you can allow the
> second group to become the first group?

Yes, assuming you mean:

  first group <= branching phase
  second group <= recording phase

So I have the priorty ordering:

  blocks(i) -> root(i) -> blocks(i+1) -> root(i+1) -> etc

And it would be possible to compress that slightly to:

  root(i-1) + blocks(i) -> root(i) + blocks(i+1) -> etc

But this requires a surprisingly large amount of extra bookkeeping so I
don't bother.  It doesn't matter either, because phase blocks greatly
outnumber metaroots, and all the writes end up back-to-back anyway. 
Nothing is saved by overlapping the root write, not even seek time (in
theory) because I have a set of metaroot locations distributed across
the partition to choose from.  I'll try to choose a metaroot location
that the elevator algorithm will like.

> >> > What we need is a sensible method/callback/library arrangement for the
> >> > sync like we now have for read/write/mmap.  What we have now is far
> >> > from sensible.  Syncing should be done one superblock at a time, not
> >> > across the entire system like it is now.  IOW, it's currently sliced
> >> > horizonally while it really needs to be sliced vertically.  We need
> >> > need a sync_filesystem method and it should default to a
> >> > generic_sync_super that does the current dumb sync.  You should then
> >> > put your improvements in as a method override, not just make the
> >> > current messy arrangement even messier.
> >>
> >> I don't entirely disagree, but reiserfs could actually sync slower if it
> >> was done an FS at a time.  write_super will commit the current
> >> transaction, which will dirty a whole bunch of metadata buffers for
> >> writing.  So, by calling write_super on every FS first, you have the
> >> chance to make better use of the underlying devices.
> >
> > I don't see how you make better use of anything.
> 
> Which is better:
> 
> ll_rw_block(buffer1) ;
> some things that can schedule/take a long time
> ll_rw_block(buffer2) ;
> more things that can schedule/take a long time
> ll_rw_block(buffer3) ;
> etc.
> 
> Or:
> 
> ll_rw_block(buffer1) ;
> ll_rw_block(buffer2) ;
> ll_rw_block(buffer3) ;
> 
> things that schedule/take a long time
> 
> For reiserfs, the current code for fsync_dev and friends will result in the
> second example, 

Exactly what I was thinking: this situation is the result of fsync_dev
and friends imposing their dumb-filesystem-friendly view of the world on
everybody.  Wouldn't it be *way* better to start a kernel thread for
each sb:

  for each sb: kernel_thread (sb->sync, sb, threadflags);

And then wait for the threads to complete?  Lots of things would be
cleaner then, and one *big* thing would happen: VFS can detect and
handle a stuck filesystem intelligently for a change.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Reply via email to