Re: Discussion of incremental checkpointing----Added some new content

Suresh Thalamati Mon, 13 Feb 2006 12:13:47 -0800

I agree, it will be nice to have checkpoints(light checkpoint) thatdoes not need to flush whole cache to the disk, especially when thecache is configured to be very large. I also remember reading about itlong time ago some where. I think the basic idea is to keep track ofthe highest LSN of the pages that got flushed to the disk and atcheckpoint flush any pages with LSN lesser than are still in thecache, this might be achieved by keeping first LSN that updated thepage , in addition to the last LSN that currently used to flush thelog when page is written to the disk. Main difference between thecurrent check point and this one will be REDO Low Water Mark can belong before the checkpoint log record. In the worst case scenario itwill be same as the current checkpoint.

Changing the page writes to synhcronous using "rwd" is good idea whencache is large. In small cache sizes like the default of 1000 pages ,it might be problem because user request for a empty page is likelyto trigger foreground sync writes.


Thanks
-suresht

Øystein Grøvlen wrote:

I would like to see the changes to checkpointing that Raynmond suggests.The main reason I like this, is that it provides separation ofconcerns. It cleanly separates the work to reduce recovery time(checkpointing) from the work to make sure that a sufficient part of thecache is clean (background writer). I think what Raymond suggests issimilar to the way ARIES propose to do checkpointing. As far as Irecall, ARIES goes a step further since checkpointing does not involveany writing of pages at all. It just update the control file based onthe oldest dirty page.
Mike Matrigali wrote:
I think my main issue is that I don't see that it is important to
optimize writing the cached dirty data.  Especially since the order
that you are proposing writing the dirty data is exactly the wrong
order to the current cache performance goal to minimize the number oftotal I/O's the
system is going to do (a page that is the oldest written exists in
a busy cache most likely because it has been written many times -
otherwise the standard background I/O thread would have written
it already).
I think your logic is flawed if you are talking about checkpointing (andnot the background writer). If you want to guarantee a certain recoverytime, you will need to write the oldest page. Otherwise, you will notbe able to advance the starting point for recovery. This approach tocheckpointing should reduce the number of I/Os since you are not writinga busy page until it is absolutely necessary. The current checkpointingwrites a lot of pages which does not do anything to make it possible togarbage-collect log. Those pages should be left to the backgroundwriter, which can use its own criteria for which pages are optimal towrite.
Raymond suggest to use his queue also for the background writer. Thisis NOT a good idea! The background writer should write those pages thatare least likely to be accessed in the near future since they are thebest candidates to be replaced in the cache. Currently a clockalgorithm is used for this. I am not convinced that is the bestapproach. I suspect that an LRU-based algorithm would be much better.(But this is separate discussion.)
If we knew derby was the only process on the machine then an approach
as you suggest might be reasonable, ie. we own all resources so we
should max out the use of all those resources.  But as a zero admin
embedded db I think derby should be more conservative in it's
resource usage.
I agree, and I think that an incremental approach makes that easier. Youare more free to pause the writing activity without significantlyimpacting recovery time. With the current checkpointing, slowing downthe writing will more directly increase the recovery time.
If we have determined that 20 MB of log will give a decent recoverytime, we can write the pages at the head of the queue at a rate thattries to keep the amount of active log around 20 MB. This should spreadcheckpointing I/O more evenly over time instead of the bursty behaviorwe have today.
I agree that your incremental approach optimizes recovery time, I
just don't think that any runtime performance hit is worth it (or even
extra complication of the checkpoint algorithms at no runtime cost).  The
system should, as you propose, attempt to guarantee a maximum
recovery time - but I see no need to work hard (ie. use extra
resources) to guarantee better than that.  Recovery is an edge case,
it should not be optimized for.
I agree, and that is why I think that we should write as few pages aspossible with recovery time in mind (i.e., during checkpointing). Inother words, we should only write pages that will actually advance thestarting point for recovery.
Also note that the current checkpoint does 2 operations to insure
each page is on disk, you can not assume the page has hit disk
until both are complete.  It first uses java write (which is async
by default), and then it forces the entire file.  The second step
is a big overhead on some systems so is not appropriate to do
for each write (where the overhead is cpu linear to the size of file
rather than the number of dirty pages).
I think we SHOULD sync for every I/O, but not the way we do today. Byopening the files with "rwd", we should be able to do this prettyefficiently already today. (At least on some systems. I am not sureabout non-POSIX systems like windows.) Syncing for every I/O gives usmuch more control over the I/O, and we will not be vulnerable to queuingeffects that we do not control.
 This has been discussed
previously on the list.  As has been pointed out the most efficient
method of writing out a number of pages is to somehow queue a small
number of writes async, and then wait for all to finish before
queueing the next set.  Unfortunately standard OS mechanisms to do
this don't exist yet in JAVA, they are being proposed in some new
JSR's.  I have been waiting for patches from others, but if one doesn't
come I will change the current checkpoint before the next release toqueue small number of
writes and then wait for the estimated time of executing those writes,
and then continue to queue more writes.  This should solve 90% of
the checkpoint I/O flood issue.
I have been planning to address this for a while, but have not been ableto do that so far. I was planning to experiment a bit with the syncingI describe above to see if there are scenarios were such an approachwould not give sufficient throughput. If that is the case, we wouldneed to parallelize the writing. If I do not have time to do that, Iwould go for something simpler as you describe above.
--
Øystein

Re: Discussion of incremental checkpointing----Added some new content

Reply via email to