Mike Matrigali wrote: >I think that some fix to this issue should be implemented for the next >release. The order of my preference is #2, #1, #3. > >
I believe option #2 (checksuming log recods in the log buffers before writing to the disk) is a good fix for this problem. If there are no objectiions to this approach, I will start to work on this. -suresht >I think that the option #2 can be implemented in the logging system and >require very little if no changes to the rest of the system processing >of log records. Log record offsets remain efficient, ie. they can use >LSN's directly. Only the boot time recovery code need look for the >new log record and do the work to verify checksums, online abort is >unaffected. > >I would like to see some performance numbers on the checksum overhead >and if it is measurable then maybe some discussion on checksum choice. >An obvious first choice would seem to be the standard java provided one >used on the data pages. If I had it to do over, I would probably have >used a different approach on the data pages. The point of the checksum >on the data page is not to catch data sector write errors, the system >expects the device to catch those, the only point is to catch >inconsistent sector writes (ie. 1st and 2nd 512 byte sector but not >3rd and 4th), for this the current checksum is overkill. For this one >need not checksum every byte on the page, >one can guarantee a consistent write with 1 bit per sector in the page. > >In the future we may want to revisit #3 if it looks like the stream log >is an I/O bottleneck which can't be addressed by striping or some other >hardware help like smart caching controllers. I see it as a performance >project rather than a correctness project. It also is a lot more work >and risk. Note that this could be a good project for someone wanting to >do some research in this area as it is implemented as a derby module >where an alternate implementation could be dropped in if available. > >While I believe that we should address this issue, I should also note >that in all my time working on cloudscape/derby I have never received a >problem database (in that time any log related error would have come >through me), that resulted from this out of order/imcomplete log >write issue - this of course does not mean it has not happened just that >it was not reported to us and/or did not affect the database in a >noticable way. We have actually never seen an out of order write from >the data pages also - we have seen a few checksum errors but all of >those were caused by a bad disk. > >On the upgrade issue, it may be time to start an upgrade thread. Here >are just some thoughts. If doing option #2, it would be nice if the >new code could still read the old log files and then optionally >write the new log record or not. Then if users wanted to run a >release in a "soft" upgrade mode where they needed to be able to >go back to the old software they could - they just would not get >this fix. On a "hard" upgrade the software should continue to read >the old log files as they are currently formatted, and for any new >log files it should begin writing the new log record. Once the new >log record make's it way into the log file accessing the db with the >old software is unsupported (it will throw an error as it won't know >what to do with the new log record). > >Suresh Thalamati (JIRA) wrote: > > > >> [ >> http://nagoya.apache.org/jira/browse/DERBY-96?page=comments#action_56482 ] >> >>Suresh Thalamati commented on DERBY-96: >>--------------------------------------- >> >>Some thoughts on how this problem could be solved: >> >>To identify the partial writes, some form of checksum has to be added to the >>log data written to the file. On recovery using the checksum information >>partial written log records could be identified and thrown away. Checksum >>information has to be included >>with the log data before it is written to the disk. Now the issue is when do >>we calculate the checksum and write to the disk. >> >>Following are some logical points when the checksum can be calculated and >>written along with log informaton: >> >>1)Calculate the checksum for each log records and store the information as >>part of log record data structure. Disadvantage of this approach, storing >>checksum with each log records could be expensive with respect to the amount >>of space and time spent to calculate. >> >>2)Calculate checksum for group of log records in the log buffers before >>writing the buffer to the disk and also write an addition log records that >>will have the checksum information and the length of the data. This log >>records (LogCheckSum) will be prefixes to the log buffer. The reason >>checksum log records are to be written in the beginning is it is easier to >>find to how much data has to be read during recovery to verify the checksum. >> >> Log data is written only when log buffers is full or make sure WAL >> protocol is not violated. Size of the data that is part of the checksum can >> potentially be 32K or whatever log buffer size is. Overhead with this >> approach is less compared to the first approach. >> >> >>3)Block-based log i/0: Idea is to group log record data into 4k/8K pages with >>a checksum on each page. During recovery checksum will be recalculated for >>each >>and match it one on the disk, if checksum does not match it is possibly as >>partial write. >> >>This approach is liked to have more overhead compared to the second one. But >>this approach also has the benefit of making log writes aligned. Not sure >>yet whether there is any performance by doing so. (Please see aligned Vs >>Non-Aligned e-mail thread on derby list). >> >>I should also bring to the attention this approach will likely require more >>changes than 1 & 2 , reasons for that are : >> >>a)Current system assumed LSN to file offset. If the data is written in page >>format , that will no longer be true. >>b)To strict to WAL protocol , it may be required that an unfilled page needs >>to be written. If this unfilled page happened to have a COMMITTED log >>records it can not be simply rewritten; If the rewrite is incomplete log >>records with committed information will be thrown away. To avoid this issue, >> log pages can not be written , which could lead to of unused space in the >>log file or implement safe-write mechanism(ping-pong algorithm). >> >> >>Upgrade: >>Irrespective of what approach is used to solve this problem, I believe new >>type of information (checksum) has to be written to the disk, which will not >>be understood by Old versions. >> >> >>Any comments/suggestion ? >> >> >>-suresh >> >> >> >> >> >>>partial log record writes that occur because of out-of order writes need to >>>be handled by recovery. >>>--------------------------------------------------------------------------------------------------- >>> >>> Key: DERBY-96 >>> URL: http://nagoya.apache.org/jira/browse/DERBY-96 >>> Project: Derby >>> Type: New Feature >>> Components: Store >>> Versions: 10.0.2.1 >>> Reporter: Suresh Thalamati >>> >>> >> >> >>>Incomplete log record write that occurs because of >>>an out of order partial writes gets recognized as complete during >>>recovery if the first sector and last sector happens to get written. >>>Current system recognizes incompletely written log records by checking >>>the length of the record that is stored in the beginning and end. >>>Format the log records are written to disk is: >>> +----------+-------------+------------------+ >>> | length | LOG RECORD | length | >>> +----------+-------------+------------------+ >>>This mechanism works fine if sectors are written in sequential manner or >>>log record size is less than 2 sectors. I believe on SCSI types disks >>>order is not necessarily sequential, SCSI disk drives may sometimes do a >>>reordering of the sectors to optimize the performance. If a log record >>>that spans multiple disk sectors is being written to SCISI type of >>>devices, it is possible that first and last sector written before the >>>crash; If this occurs recovery system will incorrectly interpret the >>>log records was completely written and replay the record. This could >>>lead to recovery errors or data corruption. >>>- >>>This problem also will not occur if a disk drive has write cache with a >>>battery backup which will make sure I/O request will complete. >>> >>> >> >> > > >
