Suresh,

Thanks for the information about log switches and checkpoints. I had to use this server for some other work, but now I have restored the crashed Derby database and started to look into the failing log files. See comments inline.

Suresh Thalamati wrote:
Olav Sandstaa (JIRA) wrote:
[ https://issues.apache.org/jira/browse/DERBY-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466475 ]
Olav Sandstaa commented on DERBY-2254:
--------------------------------------

Mike, thank you for commenting on this and for explaining what you
think are the issues behind this problem. I agree that the best
solution seems be to fix this so that the background thread can do
log file switching without waiting for the checkpoing to complete.

checkpoint and log switch can run in parallel (NOT sure we ever tested this case though ). I think problem here is we are dumping both the requests on to the same thread. It is more of performance issues. May be in the future log switches/checkpoints/... should be processed on different threads, when cache is configured with large size.

I think some log size calculations are wrong, if this bug is fixed Derby should still work by correctly switching log file before hitting the MAX limit.



Some questions somebody maybe know the answer to:

  -What is the reason for most of the log files having the default
   size of 1 MB but a few (4) log files grow to 268 kB before
   switching?




Default log file size to start a Log switch is 1MB and to perform a checkpoint is 10MB. Normally log switch requests are processed by rawStoreDaemon(checkpoint Daemon) thread without blocking the user threads for switch to complete.

Log switch happens is following three scenarios:

1) On log flush call if it finds current log file size > configure log file size, then log switch request is placed on the rawstore daemon thread.

2) As part of checkpoint ,  log switch is done before performing the
   checkpoint if  log file size > configured log file size.

3) While writing log records if  it finds the log file size has
   reached the MAX_LIMIT.

4) During backup, to backup full log files instead of partial portions.

By looking at the log file sizes you posted my guess is LOG SWITCH
requests are getting struck behind the checkpoint request. I would
think with 18GB cache, checkpoint must be taking long time.

Another case log file size grow bigger than configured size is  if the
transactions are long and nothing in the database triggered a flush call. I am guessing that is not case here , your application must be doing frequent commits; otherwise it is unlikely that there are 1M log files.

You are right. The test produces small transactions consisting of only five operations. So it is not the length of the transactions that make the log files grow very large.


  -The crash occurs after having produced about 1 GB of log. Is there
   a hard limit on 1 GB before the checkpoint must be completed?


I have not seen any such limit. Unless you have changed the defaults checkpoint is requested every 10MB , unless checkpoint is already in progress. I think the checkpoint are taking long time, parallel activity is producing lot of log. By the way how did you come up with 1GB number ?

I have not changed any defaults. The reason for asking if there was something related to 1 GB of log was probably due to I misunderstood a comment Mike made (qoute from DERBY-2254: "my best guess is that the act of switching the log file at 1 gig has become stuck behind finishing the checkpoint").

I see only  few 1MB log files and huge files in
between (268 MB)...  My guess is checkpoint is is in progress when
those large files are being created.

If you want to confirm this hypothesis, you can dump the whole log on recovery, or just modify the code to print only checkpoints from the log. you can dump the log by setting the following flags in the debug code line :

derby.debug.true=DumpLogOnly
derby.storage.logDumpStart=49

I have tried running with these settings and the resulting 3.2 GB derby.log file contains the dump of about 16 million log records. The last entries in the log are:

DEBUG LogTrace OUTPUT: scanned 12632271 : EndXact null Committed : transactionStatus = Committed instant = (1211,268435242) logEnd = (1211,268435277) logIn at 15 available 4 DEBUG LogTrace OUTPUT: scanned 12632266 : Page Operation: Page(33,Container(0, 960)) pageVersion 176243 : Update Slot=8 recordId=14 instant = (1211,268435277) logEnd = (1211,268435346) logIn at 24 available 29 DEBUG LogTrace OUTPUT: scanned 12632266 : EndXact null Committed : transactionStatus = Committed instant = (1211,268435346) logEnd = (1211,268435381) logIn at 15 available 4 DEBUG LogTrace OUTPUT: scanned Null : Checksum Operation algorithm = 1 value = 2138849209 data length= 38 instant = (1211,268435381) logEnd = (1211,268435418)

As you suggested, I tried to look for any sign of whether a checkpoint was in progress or not. But from the dump of the log I can not find any sign of checkpoints in the entire log. Do you have any suggestions for how a checkpoint would identify itself in the log dump?

If I try to categorize the entries in the log dump they fall into the following types (with the number of occurrences in the first column):

11422649   Page Operation:
2340723    BeginXact
2340734    EndXact
420947     Checksum Operation
29         detected proper log end on log file
29         switched to next log file

None of these seems to indicate that a checkpoint have started?

Regards,
Olav


Reply via email to