I agree, it will be nice to have checkpoints(light checkpoint) that
does not need to flush whole cache to the disk, especially when the
cache is configured to be very large. I also remember reading about it
long time ago some where. I think the basic idea is to keep track of
the highest LSN of the pages that got flushed to the disk and at
checkpoint flush any pages with LSN lesser than are still in the
cache, this might be achieved by keeping first LSN that updated the
page , in addition to the last LSN that currently used to flush the
log when page is written to the disk. Main difference between the
current check point and this one will be REDO Low Water Mark can be
long before the checkpoint log record. In the worst case scenario it
will be same as the current checkpoint.
Changing the page writes to synhcronous using "rwd" is good idea when
cache is large. In small cache sizes like the default of 1000 pages ,
it might be problem because user request for a empty page is likely
to trigger foreground sync writes.
Thanks
-suresht
Øystein Grøvlen wrote:
I would like to see the changes to checkpointing that Raynmond suggests.
The main reason I like this, is that it provides separation of
concerns. It cleanly separates the work to reduce recovery time
(checkpointing) from the work to make sure that a sufficient part of the
cache is clean (background writer). I think what Raymond suggests is
similar to the way ARIES propose to do checkpointing. As far as I
recall, ARIES goes a step further since checkpointing does not involve
any writing of pages at all. It just update the control file based on
the oldest dirty page.
Mike Matrigali wrote:
I think my main issue is that I don't see that it is important to
optimize writing the cached dirty data. Especially since the order
that you are proposing writing the dirty data is exactly the wrong
order to the current cache performance goal to minimize the number of
total I/O's the
system is going to do (a page that is the oldest written exists in
a busy cache most likely because it has been written many times -
otherwise the standard background I/O thread would have written
it already).
I think your logic is flawed if you are talking about checkpointing (and
not the background writer). If you want to guarantee a certain recovery
time, you will need to write the oldest page. Otherwise, you will not
be able to advance the starting point for recovery. This approach to
checkpointing should reduce the number of I/Os since you are not writing
a busy page until it is absolutely necessary. The current checkpointing
writes a lot of pages which does not do anything to make it possible to
garbage-collect log. Those pages should be left to the background
writer, which can use its own criteria for which pages are optimal to
write.
Raymond suggest to use his queue also for the background writer. This
is NOT a good idea! The background writer should write those pages that
are least likely to be accessed in the near future since they are the
best candidates to be replaced in the cache. Currently a clock
algorithm is used for this. I am not convinced that is the best
approach. I suspect that an LRU-based algorithm would be much better.
(But this is separate discussion.)
If we knew derby was the only process on the machine then an approach
as you suggest might be reasonable, ie. we own all resources so we
should max out the use of all those resources. But as a zero admin
embedded db I think derby should be more conservative in it's
resource usage.
I agree, and I think that an incremental approach makes that easier. You
are more free to pause the writing activity without significantly
impacting recovery time. With the current checkpointing, slowing down
the writing will more directly increase the recovery time.
If we have determined that 20 MB of log will give a decent recovery
time, we can write the pages at the head of the queue at a rate that
tries to keep the amount of active log around 20 MB. This should spread
checkpointing I/O more evenly over time instead of the bursty behavior
we have today.
I agree that your incremental approach optimizes recovery time, I
just don't think that any runtime performance hit is worth it (or even
extra complication of the checkpoint algorithms at no runtime cost). The
system should, as you propose, attempt to guarantee a maximum
recovery time - but I see no need to work hard (ie. use extra
resources) to guarantee better than that. Recovery is an edge case,
it should not be optimized for.
I agree, and that is why I think that we should write as few pages as
possible with recovery time in mind (i.e., during checkpointing). In
other words, we should only write pages that will actually advance the
starting point for recovery.
Also note that the current checkpoint does 2 operations to insure
each page is on disk, you can not assume the page has hit disk
until both are complete. It first uses java write (which is async
by default), and then it forces the entire file. The second step
is a big overhead on some systems so is not appropriate to do
for each write (where the overhead is cpu linear to the size of file
rather than the number of dirty pages).
I think we SHOULD sync for every I/O, but not the way we do today. By
opening the files with "rwd", we should be able to do this pretty
efficiently already today. (At least on some systems. I am not sure
about non-POSIX systems like windows.) Syncing for every I/O gives us
much more control over the I/O, and we will not be vulnerable to queuing
effects that we do not control.
This has been discussed
previously on the list. As has been pointed out the most efficient
method of writing out a number of pages is to somehow queue a small
number of writes async, and then wait for all to finish before
queueing the next set. Unfortunately standard OS mechanisms to do
this don't exist yet in JAVA, they are being proposed in some new
JSR's. I have been waiting for patches from others, but if one doesn't
come I will change the current checkpoint before the next release to
queue small number of
writes and then wait for the estimated time of executing those writes,
and then continue to queue more writes. This should solve 90% of
the checkpoint I/O flood issue.
I have been planning to address this for a while, but have not been able
to do that so far. I was planning to experiment a bit with the syncing
I describe above to see if there are scenarios were such an approach
would not give sufficient throughput. If that is the case, we would
need to parallelize the writing. If I do not have time to do that, I
would go for something simpler as you describe above.
--
Øystein