Mike, thank you for you comments. They really help me a lot. I would
like to make more discussion on the issue.
RR 2. During initilization of Derby, we run some measurement that
RR determines the performance of the system and maps the
RR recovery time into some X megabytes of log.
MM What do you mean by initialization? Once per boot, once per db
MM creation, something else?
Initialization here means once per boot
RR 3. establish a dirty page list in wich dirty pages are sorted in
RR ascending
RR order of the time when they were firt updated. When one dirty
RR page
RR is flushed out to disk, it will be released from the
RR link.(this step needs
RR further discussion,whether we need to establish such a list)
MM I am not convinced of a need for such a list, and do not want to see
MM such a list slow down non-checkpoint related activity. From other
MM reported Derby issues it is clear we actually want to "slow" the
MM checkpoint down rather than optimize it.
Actually the list is not designed to speed up the checkpoint process. It
is for the incremental checkpoint described in step 5. With this list, we
can guarantee that the oldest dirty pages are written out. We can think that
the incrementl checkpoint divide the courrent checkpoint process in to
several pieces, and try to make every piece of checkpoint as an intact
checkpoint process. In another word, the incremental checkpoint process
never stops, it just use the time, when the system is not so busy, to do
piece of checkpoint. If we search the entire cache space to write out dirty
page as what we do now, we can't guarantee the oldest dirty pages are
written
out and the redoLWM may be much older than the the last checkpoint mark.
MM The downside with the
MM current
MM algorithm is that a page that is made dirty after the checkpoint
MM starts
MM will be written, and if it gets written again before the next
MM checkpoint
MM we have done 1 too many I/O's. I think I/O optimization may benefit
MM more by working on optimizing the background I/O thread than working
MM on the checkpoint.
If the background I/O thread can refer to this list.I think it can help
solve the problem you mentioned. I am not very familiar with the background
I/O thread. If I am wrong, please point it out.
In the list, the dirt pages are sorted in ascending order of the time when
they were firt updated, which means the oldest dirty page is in the head of
the list and the latest updated dirty page is in the end of the list.
The operations on the list are :
- When a page is updated and it is not in the list, we will append it to
the end of the list.
- When a dirty page in the list is written out to disk, it will be released
from the list.
Let's look into your problem:
if a page is made dirty after the checkpoint starts,
1) if the page was made dirty before this update, it was supposed to be
in the list already.We don't need to add it again.
When the checkpoint process writes this dirty page out to disk, it will
be released from the list, and if the background I/O thread refer to
the list, it will know it's no need to write this page out again.
2) if the page was first time updated. It will be appended to the end
of the list.If the background I/O thread refer to the list, it knows
it's no need to write this page out so soon since it has just been
updated.
RR 4. A checkpoint is made and controled in combined consideration
RR of
RR -the acceptable log length which we get in step 2
RR -the current IO performance
RR 5. We do increamental checkpoint.That means:
RR From the beginning of the dirty page list established in step
RR 3,(the
RR earliest updated dirty page), to the end of the list (the
RR latest updated
RR dirty page), we do checkpoint. If data reads or a log writes
RR (if log in
RR default location) start to have longer response times then a
RR appropriate
RR value,we pause the checkpoint process and update the log
RR control file to let
RR derby know where we are.When the data reads or log writes
RR time return to
RR acceptable value, we continue to do checkpoint.
MM This sounds like you are looking to address DERBY-799. I though
MM Oystein
MM was going to work on this, but there is no owner so I am not sure.
MM You
MM may at least want to consult with him on his findings.
I wrote a mail to Oystein and hope he would give me some comments.
RR This is just an outline. I would like to discuss details about them
RR with everyone
RR later.If anyone has any suggestion, please let me know.
RR
RR Now, I am going to design the 2nd step first to map the recovery
RR time into some
RR X megabytes of log. A simple approach is that we can design a test
RR log file. In the
RR log file, we can let derby create a temporary database and do a
RR bunch of test to get
RR necessary disk IO information, and then delete the temporary
RR database. When derby
RR boots up, we let it to do recovery from the test log file.Anyone
RR has some other
RR suggestions on it?
MM I'll think about this, it is not straight forward. My guess would
MM be that recovery time is dominated by 2 factors:
MM 1) I/O from log disk
MM 2) I/O from data disk
MM
MM Item 1 is pretty easy to get a handle on. During redo it is pretty
MM much
MM a straight scan from beginning to end doing page based I/O. Undo is
MM harder as it jumps back and forth for each xact. I would probably
MM just
MM ignore it for estimates.
MM
MM Item 2 is totally dependent on cache rate hit you are going to
MM expect, and the number of log records.
MM The majority of log records deal with a single page, it will read
MM the page into cache if it doesn't exist and then it will do a quick
MM operation on that page. Again undo is slightly more complicated as
MM it
MM could involve logical lookups in the index.
MM
MM Another option rather than do any sort of testing is to come up with
MM an
MM initial default time based on size of log file. And then on each
MM subsequent recovery event dynamically change the estimate based on
MM how
MM long that recovery on that db took. This way each estimate will be
MM based on the actual work generated by the application, and over time
MM should become better and better.
Thank you for your suggestion. I will think about it more carefully and
discuss it later.
RR
RR I am wondering do I need to establish some relationship between the
RR data reads time
RR and the data writes time. I mean, under a certain average data
RR reads time, approximately
RR how long would the average data writes time be.Since what we get
RR from step2 is jusn
RR under a certain system condition,when the system condition
RR changes(becomes busier),
RR the value should change too. If I can establish such a
RR relationship,then I can make
RR acurate adjustment on the checkpoint process.
RR
RR
RR
RR
RR Raymond
RR
_________________________________________________________________
Take advantage of powerful junk e-mail filters built on patented Microsoft®
SmartScreen Technology.
http://join.msn.com/?pgmarket=en-ca&page=byoa/prem&xAPID=1994&DI=1034&SU=http://hotmail.com/enca&HL=Market_MSNIS_Taglines
Start enjoying all the benefits of MSN® Premium right now and get the
first two months FREE*.