[ http://nagoya.apache.org/jira/browse/DERBY-96?page=comments#action_56482 
]
     
Suresh Thalamati commented on DERBY-96:
---------------------------------------

Some thoughts on how this problem could be solved:

To identify the partial writes, some form of checksum has to be added to the 
log data written to the file. On recovery using the checksum information 
partial written log records could be identified and thrown away.  Checksum 
information has to be included 
with the log data before it is written to the disk. Now the issue is when do we 
calculate the checksum and write to the disk. 

Following are some logical points when the checksum can be calculated and 
written along with log informaton:

1)Calculate the checksum for each log records and store the information as part 
of log record data structure.  Disadvantage of this approach, storing checksum 
with each log records could be expensive with respect to the amount of space 
and time spent to calculate. 

2)Calculate checksum for group of log records in the log buffers before writing 
the buffer to the disk and also write an addition log records that will have 
the checksum information and the length of the data. This log records 
(LogCheckSum) will be prefixes to the log buffer. The reason  checksum log 
records are to be written in the beginning  is it is easier to find to how much 
data has to be read during recovery to verify the checksum. 

    Log data is written only when log buffers is full or make sure WAL protocol 
is not violated.  Size of the data that is part of the checksum can potentially 
be 32K or whatever log buffer size is. Overhead with this approach is less 
compared to the first approach. 


3)Block-based log i/0: Idea is to group log record data into 4k/8K pages with a 
checksum on each page. During recovery checksum will be recalculated for  each 
and match it one on the disk, if checksum does not match it is possibly as 
partial write. 
 
This approach is liked to have more overhead compared to the second one.  But 
this approach also has the benefit of making log writes aligned.  Not sure yet 
whether there is any performance by doing so. (Please see aligned Vs 
Non-Aligned e-mail thread on derby list). 

I should also bring to the attention this approach will likely require more 
changes  than 1 & 2 , reasons for that are :

a)Current system assumed LSN to file offset. If  the data is written in page 
format , that will no longer be true. 
b)To strict to WAL protocol , it may be required that an unfilled page needs to 
be written.  If this unfilled page happened to have a COMMITTED log records it 
can not be simply rewritten; If the rewrite is incomplete log records with 
committed information will be thrown away.  To avoid this issue,  log pages can 
not be written , which could lead to of unused space in the log file or 
implement safe-write mechanism(ping-pong algorithm). 


Upgrade: 
Irrespective of what approach is used to solve this problem, I believe new type 
of information (checksum) has to be written to the disk, which will not be 
understood by Old versions.  


Any comments/suggestion ?


-suresh


> partial log record writes that occur because of out-of order writes need to 
> be handled by recovery.
> ---------------------------------------------------------------------------------------------------
>
>          Key: DERBY-96
>          URL: http://nagoya.apache.org/jira/browse/DERBY-96
>      Project: Derby
>         Type: New Feature
>   Components: Store
>     Versions: 10.0.2.1
>     Reporter: Suresh Thalamati

>
> Incomplete log record write that occurs because of
> an out of order partial writes gets recognized as complete during
> recovery if the first sector and last sector happens to get written.
>  Current system recognizes incompletely written log records by checking
> the length of the record that is stored in the beginning and end.
>  Format the log records are written to disk is:
>   +----------+-------------+------------------+
>   | length     |  LOG RECORD |    length   |
>   +----------+-------------+------------------+
> This mechanism works fine if sectors are written in sequential manner or
> log record size is less than 2 sectors. I  believe on SCSI types disks
> order is not necessarily sequential, SCSI disk drives may sometimes do a
> reordering of the sectors to optimize the performance.  If a log record
> that spans multiple disk sectors is being written to SCISI type of
> devices,  it is possible that first and last sector written before the
> crash; If this occurs recovery system will incorrectly  interpret the
> log records was completely written and replay the record. This could
> lead to recovery errors or data corruption.
> -
> This problem also will not occur if a disk drive has write cache with a
> battery backup which will make sure I/O request will complete.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://nagoya.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira

Reply via email to