Re: [HACKERS] Analysis of ganged WAL writes

Hannu Krosing Mon, 07 Oct 2002 13:32:27 -0700

On Tue, 2002-10-08 at 01:27, Tom Lane wrote:
>
> The scheme we now have (with my recent patch) essentially says that the
> commit delay seen by any one transaction is at most two disk rotations.
> Unfortunately it's also at least one rotation :-(, except in the case
> where there is no contention, ie, no already-scheduled WAL write when
> the transaction reaches the commit stage.  It would be nice to be able
> to say "at most one disk rotation" instead --- but I don't see how to
> do that in the absence of detailed information about disk head position.
> 
> Something I was toying with this afternoon: assume we have a background
> process responsible for all WAL writes --- not only filled buffers, but
> the currently active buffer.  It periodically checks to see if there
> are unwritten commit records in the active buffer, and if so schedules
> a write for them.  If this could be done during each disk rotation,
> "just before" the disk reaches the active WAL log block, we'd have an
> ideal solution.  And it would not be too hard for such a process to
> determine the right time: it could measure the drive rotational speed
> by observing the completion times of successive writes to the same
> sector, and it wouldn't take much logic to empirically find the latest
> time at which a write can be issued and have a good probability of
> hitting the disk on time.  (At least, this would work pretty well given
> a dedicated WAL drive, else there'd be too much interference from other
> I/O requests.)
> 
> However, this whole scheme falls down on the same problem we've run into
> before: user processes can't schedule themselves with millisecond
> accuracy.  The writer process might be able to determine the ideal time
> to wake up and make the check, but it can't get the Unix kernel to
> dispatch it then, at least not on most Unixen.  The typical scheduling
> slop is one time slice, which is comparable to if not more than the
> disk rotation time.


Standard for Linux has been 100Hz time slice, but it is configurable for
some time.

The latest RedHat (8.0) is built with 500Hz that makes about 4
slices/rev for 7200 rpm disks (2 for 15000rpm)
 
> ISTM aio_write only improves the picture if there's some magic in-kernel
> processing that makes this same kind of judgment as to when to issue the
> "ganged" write for real, and is able to do it on time because it's in
> the kernel.  I haven't heard anything to make me think that that feature
> actually exists.  AFAIK the kernel isn't much more enlightened about
> physical head positions than we are.

At least for open source kernels it could be possible to

1. write a patch to kernel 

or 

2. get the authors of kernel aio interested in doing it.

or

3. the third possibility would be using some real-time (RT) OS or mixed
RT/conventional OS where some threads can be scheduled for hard-RT .
In an RT os you are supposed to be able to do exactly what you describe.


I think that 2 and 3 could be "outsourced" (the respective developers
talked into supporting it) as both KAIO and RT Linuxen/BSDs are probably
also inetersted in high-profile applications so they could boast that
"using our stuff enabled PostgreSQL database run twice as fast".

Anyway, getting to near-harware speeds for database will need more
specific support from OS than web browsing or compiling.

---------------
Hannu








---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [HACKERS] Analysis of ganged WAL writes

Reply via email to