Mike Matrigali wrote:
Øystein Grøvlen wrote:
Mike Matrigali wrote:
I think my main issue is that I don't see that it is important to
optimize writing the cached dirty data. Especially since the order
that you are proposing writing the dirty data is exactly the wrong
order to the current cache performance goal to minimize the number of
total I/O's the
system is going to do (a page that is the oldest written exists in
a busy cache most likely because it has been written many times -
otherwise the standard background I/O thread would have written
it already).
I think your logic is flawed if you are talking about checkpointing
(and not the background writer). If you want to guarantee a certain
recovery time, you will need to write the oldest page. Otherwise, you
will not be able to advance the starting point for recovery. This
approach to checkpointing should reduce the number of I/Os since you
are not writing a busy page until it is absolutely necessary. The
current checkpointing writes a lot of pages which does not do anything
to make it possible to garbage-collect log. Those pages should be left
to the background writer, which can use its own criteria for which
pages are optimal to write.
I guess I was not clear, I agree with you:
checkpoint - wants to write oldest page, I agree this is necessary
to move the redo low water mark.
background - wants to write least used, probably not oldest page.
What pages are you talking about that the current checkpoint process
writes that are not necessary. Are they the ones that go from clean
to dirty after the checkpoint starts? It seems that in current
checkpoint all pages dirty at the start are necessary to move the redo
low water mark.
It is not necessary to write all pages to be able to move the redo low
watermark forward. It is necessary to write all pages to move the redo
low water all the way up to the new checkpoint log record. However,
that will probably give a much lower recovery time than what we are
aiming for. Hence, we can skip writing the newer pages and still be
within the requested recovery time.
...
I think we SHOULD sync for every I/O, but not the way we do today. By
opening the files with "rwd", we should be able to do this pretty
efficiently already today. (At least on some systems. I am not sure
about non-POSIX systems like windows.) Syncing for every I/O gives us
much more control over the I/O, and we will not be vulnerable to
queuing effects that we do not control.
Do you think we should sync for every I/O in the non-checkpoint case
also. The case I am most interested in, is where a user transaction
needs to wait for a page in the cache and the only way to give that
page is by writing another page in the cache out. Currently this write
is async, are you proposing to change this to a sync write?
This scenario should be very rare. If is not rare, async writing will
probably just lead you into trouble over time since you will allow user
threads to proceed at a rate that the file system will not be able to
sustain in the long run. Also, see my reply to Suresh where I discuss a
way this could be handle so it is still async with respect to user threads.
--
Øystein