Re: Disk sync at shutdown and fusefs filesystems

2007-12-11 Thread Karsten Behrmann
On Tue, 11 Dec 2007 12:22:35 -0800 (PST), Doug Barton wrote:
 
 I suppose this is mostly a style difference, but I like to avoid all those 
 subshells if we can. I also think it might be a good idea to wait a second 
 between unmounts, just to be paranoid. How about:
 
 mount | while read dev d1 mountpoint d2; do
   case $dev in
   /dev/fuse[0-9]*) umount $mountpoint ; sleep 1 ;;
   esac
 done
 sleep 1

Hmm, if you truly want to be paranoid, you probably should be unmounting
those in reverse order, because someone might be mounting one fuse-fs
inside another ;)

just my 2 cents,
  Karsten

-- 
Open source is not about suing someone who sells your software. It is
about being able to walk behind him, grinning, and waving free CDs with
the equivalent of what he is trying to sell.


signature.asc
Description: PGP signature


Re: unionfs kqueue?

2007-12-09 Thread Karsten Behrmann
Heya,

 Does unionfs work with kqueue? When I run `tail -f` on a file residing
 on unionfs with cd9660 underneeth and md+ufs over it, it doesn't detect
 changes. The changes are immediately visible, just not with tail -f.

Hmm. When you start the tail -f, does the file reside on the cd9660 or
already on the md?
See if tail -F does a better job.
My guess would be that, since you cannot modify a file on any filesystem
except the top one, unionfs must change semantics of open so that even
opening for writing or appending silently creates a new copy of the file
on the top filesystem (if the file didn't reside there already).
As tail -f still has the lower-layer file open,
it never notices that there's a new file by the name.

(this behavior is the same as

echo foo foo
tail -f foo
# in another terminal
echo bar bar
mv bar foo

which also fails to notice the new data)

So Far,
  Karsten BearPerson Behrmann

p.s.: this is probably why the -F option was added to tail

-- 
Open source is not about suing someone who sells your software. It is
about being able to walk behind him, grinning, and waving free CDs with
the equivalent of what he is trying to sell.


signature.asc
Description: PGP signature


Re: Pluggable Disk Scheduler Project

2007-10-19 Thread Karsten Behrmann
 I wouldn't focus on just anticipation, but also other types of
 schedulers (I/O scheduling influenced by nice value?)

One note on this subject: it will be nice for things like bg-fsck,
but it also brings up a point that has been overlooked so far:
priority propagation.

As far as I am aware, there's plenty of locks that a process can
hold while waiting for I/O to complete (directories, snapshots,
probably some others I missed)
Now, normally when a high-priority thread needs a lock owned by
another process, it'll bump up that thread's priority and yield,
hopefully freeing the lock up quickly.
This is obviously not quite so easy for IO. I haven't quite
understood the code involved yet, as far as I can tell turnstiles
panic when the lock-owning thread is sleeping.

What we'll probably need to do is make priority propagation wake
up a waiting-for-io process, which then needs to dig up its IO
request (which may be anywhere in geom, but potentially held back
by the scheduler) and make sure it's [re]queued with higher
priority.
If we don't do this, we'll get funny effects with a bg-fsck blocking
some high-priority process indefinitely because it happens to be
waiting for IO and holding the snapshot lock, on an IO-busy system.
If we do this, we may get into significant fun with cutting into
geom to allow requeuing, or waste some cpu with polling from the
queuing geom.

This point may not be immediately obvious to people coming from the
IO/filesystem field, but it is something we should keep in mind.

So Far,
  Karsten Behrmann


signature.asc
Description: PGP signature


Re: Pluggable Disk Scheduler Project

2007-10-16 Thread Karsten Behrmann
 Hi,
 is anybody working on the `Pluggable Disk Scheduler Project' from
 the ideas page?
I've been kicking the idea around in my head, but I'm probably newer to
everything involved than you are, so feel free to pick it up. If you want,
we can toss some ideas and code to each other, though I don't really
have anything on the latter.

[...]
 After reading [1], [2] and its follow-ups the main problems that
 need to be addressed seem to be:
 
 o is working on disk scheduling worth at all?
Probably, one of the main applications would be to make the background
fsck a little more well-behaved.

 o Where is the right place (in GEOM) for a disk scheduler?
I have spent some time at eurobsdcon talking to Kirk and phk about
this, and the result was that I now know strong proponents both for
putting it into the disk drivers and for putting it into geom ;-)

Personally, I would put it into geom. I'll go into more detail on
this later, but basically, geom seems a better fit for high-level
code than a device driver, and if done properly performance penalties
should be negligible.

 o How can anticipation be introduced into the GEOM framework?
I wouldn't focus on just anticipation, but also other types of
schedulers (I/O scheduling influenced by nice value?)

 o What can be an interface for disk schedulers?
good question, but geom seems a good start ;)

 o How to deal with devices that handle multiple request per time?
Bad news first: this is most disks out there, in a way ;)
SCSI has tagged queuing, ATA has native command queing or
whatever the ata people came up over their morning coffee today.
I'll mention a bit more about this further down.

 o How to deal with metadata requests and other VFS issues?
Like any other disk request, though for priority-respecting
schedulers this may get rather interesting.

[...]
 The main idea is to allow the scheduler to enqueue the requests
 having only one (other small fixed numbers can be better on some
 hardware) outstanding request and to pass new requests to its
 provider only after the service of the previous one ended.
You'll want to queue at least two requests at once. The reason for
this is performance:
Currently, drivers queue their own I/O. This means that as soon
as a request completes (on devices that don't have in-device
queues), they can fairly quickly grab a new request from their
internal queue and push it back to the device from the interrupt
handler or some other fast method.
Having the device idle while the response percolates up the geom
stack and a new request down will likely be rather wasteful.
For disks with queuing, I'd recommend trying to keep the queue
reasonably full (unless the queuing strategy says otherwise),
for disks without queuing I'd say we want to push at least one
more request down. Personally, I think the sanest design would
be to have device drivers return a temporary I/O error along
the lines of EAGAIN, meaning their queue is full.

 The example scheduler in the draft takes the following approach:
 
 o a scheduling GEOM class is introduced.  It can be stacked on
   top of disk geoms, and schedules all the requests coming
   from its consumers.  I'm not absolutely sure that a new class
   is really needed but I think that it can simplify testing and
   experimenting with various solutions on the scheduler placement.
Probably, though we'll want to make sure that they stack on top of
(or are inside of?) the geoms talking to the disks, because it rarely
makes sense to put a queuing geom on top of, say, a disklabel geom.

The advantage of making it a full geom is configurability. You would
be able to swap out a scheduler at runtime, select different sched-
ulers for different disks, and potentially even test new schedulers
without rebooting (though you wouldn't want to do that for benchmarks)

 o  Requests coming from consumers are passed down immediately
   if there is no other request under service, otherwise they
   are queued in a bioq.
This is specific to the anticipatory scheduler. I would say in more
general terms:
- A queuing geom is to push all requests that it wants serviced down
towards the disk, until the disk reports a queue full. A queuing geom
is allowed to hold back requests even when the driver queue is not
full yet, if it does not want the disk to attempt such I/O yet (such
as the anticipatory scheduler waiting for another disk request near
the last one, or the process-priority scheduler holding back a low-
priority request that would potentially cause a long seek, until io
has been idle)
This dispels phk's anti-geom argument of it will be inefficient
because it will take longer for a new request to get to the driver -
if the queuing strategy had wanted the request to be sent to the
drive, it would already have sent it. (The exception is that the disk
will have its internal queue a little more empty than it could be -
not a big issue with queue sizes of 8 or 16)

[...]