Mike Matrigali wrote:

>I think that some fix to this issue should be implemented for the next
>release.  The order of my preference is #2, #1, #3.
>  
>

I believe option #2 (checksuming log recods in the log buffers before
writing to the disk)  is a good fix for this problem.
If there are no objectiions to this approach,  I will start to work on
this.


-suresht


>I think that the option #2 can be implemented in the logging system and
>require very little if no changes to the rest of the system processing
>of log records.  Log record offsets remain efficient, ie. they can use
>LSN's directly.  Only the boot time recovery code need look for the
>new log record and do the work to verify checksums, online abort is
>unaffected.
>
>I would like to see some performance numbers on the checksum overhead
>and if it is measurable then maybe some discussion on checksum choice.
>An obvious first choice would seem to be the standard java provided one
>used on the data pages.  If I had it to do over, I would probably have
>used a different approach on the data pages.  The point of the checksum
>on the data page is not to catch data sector write errors, the system
>expects the device to catch those, the only point is to catch
>inconsistent sector writes (ie. 1st and 2nd 512 byte sector but not
>3rd and 4th), for this the current checksum is overkill.  For this one
>need not checksum every byte on the page,
>one can guarantee a consistent write with 1 bit per sector in the page.
>
>In the future we may want to revisit #3 if it looks like the stream log
>is an I/O bottleneck which can't be addressed by striping or some other
>hardware help like smart caching controllers.  I see it as a performance
>project rather than a correctness project.  It also is a lot more work
>and risk.  Note that this could be a good project for someone wanting to
>do some research in this area as it is implemented as a derby module
>where an alternate implementation could be dropped in if available.
>
>While I believe that we should address this issue, I should also note
>that in all my time working on cloudscape/derby I have never received a
>problem database (in that time any log related error would have come
>through me), that resulted from this out of order/imcomplete log
>write issue - this of course does not mean it has not happened just that
>it was not reported to us and/or did not affect the database in a
>noticable way.  We have actually never seen an out of order write from
>the data pages also - we have seen a few checksum errors but all of
>those were caused by a bad disk.
>
>On the upgrade issue, it may be time to start an upgrade thread.  Here
>are just some thoughts.  If doing option #2, it would be nice if the
>new code could still read the old log files and then optionally
>write the new log record or not.  Then if users wanted to run a
>release in a "soft" upgrade mode where they needed to be able to
>go back to the old software they could - they just would not get
>this fix.  On a "hard" upgrade the software should continue to read
>the old log files as they are currently formatted, and for any new
>log files it should begin writing the new log record.  Once the new
>log record make's it way into the log file accessing the db with the
>old software is unsupported (it will throw an error as it won't know
>what to do with the new log record).
>
>Suresh Thalamati (JIRA) wrote:
>
>  
>
>>     [ 
>> http://nagoya.apache.org/jira/browse/DERBY-96?page=comments#action_56482 ]
>>     
>>Suresh Thalamati commented on DERBY-96:
>>---------------------------------------
>>
>>Some thoughts on how this problem could be solved:
>>
>>To identify the partial writes, some form of checksum has to be added to the 
>>log data written to the file. On recovery using the checksum information 
>>partial written log records could be identified and thrown away.  Checksum 
>>information has to be included 
>>with the log data before it is written to the disk. Now the issue is when do 
>>we calculate the checksum and write to the disk. 
>>
>>Following are some logical points when the checksum can be calculated and 
>>written along with log informaton:
>>
>>1)Calculate the checksum for each log records and store the information as 
>>part of log record data structure.  Disadvantage of this approach, storing 
>>checksum with each log records could be expensive with respect to the amount 
>>of space and time spent to calculate. 
>>
>>2)Calculate checksum for group of log records in the log buffers before 
>>writing the buffer to the disk and also write an addition log records that 
>>will have the checksum information and the length of the data. This log 
>>records (LogCheckSum) will be prefixes to the log buffer. The reason  
>>checksum log records are to be written in the beginning  is it is easier to 
>>find to how much data has to be read during recovery to verify the checksum. 
>>
>>    Log data is written only when log buffers is full or make sure WAL 
>> protocol is not violated.  Size of the data that is part of the checksum can 
>> potentially be 32K or whatever log buffer size is. Overhead with this 
>> approach is less compared to the first approach. 
>>
>>
>>3)Block-based log i/0: Idea is to group log record data into 4k/8K pages with 
>>a checksum on each page. During recovery checksum will be recalculated for  
>>each 
>>and match it one on the disk, if checksum does not match it is possibly as 
>>partial write. 
>> 
>>This approach is liked to have more overhead compared to the second one.  But 
>>this approach also has the benefit of making log writes aligned.  Not sure 
>>yet whether there is any performance by doing so. (Please see aligned Vs 
>>Non-Aligned e-mail thread on derby list). 
>>
>>I should also bring to the attention this approach will likely require more 
>>changes  than 1 & 2 , reasons for that are :
>>
>>a)Current system assumed LSN to file offset. If  the data is written in page 
>>format , that will no longer be true. 
>>b)To strict to WAL protocol , it may be required that an unfilled page needs 
>>to be written.  If this unfilled page happened to have a COMMITTED log 
>>records it can not be simply rewritten; If the rewrite is incomplete log 
>>records with committed information will be thrown away.  To avoid this issue, 
>> log pages can not be written , which could lead to of unused space in the 
>>log file or implement safe-write mechanism(ping-pong algorithm). 
>>
>>
>>Upgrade: 
>>Irrespective of what approach is used to solve this problem, I believe new 
>>type of information (checksum) has to be written to the disk, which will not 
>>be understood by Old versions.  
>>
>>
>>Any comments/suggestion ?
>>
>>
>>-suresh
>>
>>
>>
>>    
>>
>>>partial log record writes that occur because of out-of order writes need to 
>>>be handled by recovery.
>>>---------------------------------------------------------------------------------------------------
>>>
>>>        Key: DERBY-96
>>>        URL: http://nagoya.apache.org/jira/browse/DERBY-96
>>>    Project: Derby
>>>       Type: New Feature
>>> Components: Store
>>>   Versions: 10.0.2.1
>>>   Reporter: Suresh Thalamati
>>>      
>>>
>>    
>>
>>>Incomplete log record write that occurs because of
>>>an out of order partial writes gets recognized as complete during
>>>recovery if the first sector and last sector happens to get written.
>>>Current system recognizes incompletely written log records by checking
>>>the length of the record that is stored in the beginning and end.
>>>Format the log records are written to disk is:
>>> +----------+-------------+------------------+
>>> | length     |  LOG RECORD |    length   |
>>> +----------+-------------+------------------+
>>>This mechanism works fine if sectors are written in sequential manner or
>>>log record size is less than 2 sectors. I  believe on SCSI types disks
>>>order is not necessarily sequential, SCSI disk drives may sometimes do a
>>>reordering of the sectors to optimize the performance.  If a log record
>>>that spans multiple disk sectors is being written to SCISI type of
>>>devices,  it is possible that first and last sector written before the
>>>crash; If this occurs recovery system will incorrectly  interpret the
>>>log records was completely written and replay the record. This could
>>>lead to recovery errors or data corruption.
>>>-
>>>This problem also will not occur if a disk drive has write cache with a
>>>battery backup which will make sure I/O request will complete.
>>>      
>>>
>>    
>>
>
>  
>

Reply via email to