[jira] [Commented] (CASSANDRA-1717) Cassandra cannot detect corrupt-but-readable column data

Sylvain Lebresne (JIRA) Fri, 05 Aug 2011 03:58:17 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079900#comment-13079900
 ]


Sylvain Lebresne commented on CASSANDRA-1717:
---------------------------------------------

My 2 cents:

I see 3 options that seems to make sense somehow:
# checksums at the column level:
  ** pros: easy to do, easy to recover from a bitrot and efficiently so 
(efficiently in that in general we would be able to only drop one column for a 
given bitrot; it's more complicated if something in the row header (row key, 
row size, ...) is bitrotten though).
  ** cons: high overhead (mainly in disk space usage but also on cpu usage 
because we have much more checksums to check)
# checksums at the row level (or column index level, but I think this is 
essentially the same, isn't it?):
  ** pros: easy to recover from bitrot (we drop the row), though potentially 
more wasteful than "column level". Incurs a small space overhead for big rows.
  ** cons: can't realistically check on every reads, so we need to do it only 
on compaction/repair and on read digest mismatch (that last one is non optional 
if we want checksums to be sure in that bitrot never propagate to other node); 
this adds complexity and some I/O to check checksums on read digest mismatch 
that is not necessary (read digest mismatch won't in general be due to bitrot). 
Also incurs a important space overhead for tiny rows.
# checksums at the block level:
  ** pros: super easy in the compressed case (can be done "on every read", or 
more precisely each time we read a block). Incurs a minimum overhead.
  ** cons: super *not* easy in the non-compressed case. We don't have blocks in 
the uncompressed case. While writing, we could use the buffer size as a block 
size and add a checksum on flush. The problems are on reads however.  First, we 
would need to align buffers on reads (which we don't do in the non-compressed 
case) as Pavel said, which likely involves more reBuffer in general (aka more 
I/O). But perhaps more importantly, I have no clue how you could make that work 
with mmap efficiently (we would potentially have a checksum in the middle of a 
column value as far as mmap is concerned).  Also slightly harder to recover 
from bitrot without dropping the whole sstable (but doable as long as we have 
the index around).

There may be other solutions I don't see, and there may be some pros/cons for 
the ones above that I have missed (please feel free to complete).

But based on those, my personal opinion is that "column level" has too big an 
overhead and "block level" is really problematic in the mmap non-compressed 
case (but sound like the best option to me if we ignore mmap).

So my personal preference leans towards using "block level" but only having 
checksums in the compressed case and maybe in an uncompressed mode for which 
mmap would be deactivated.

If we really don't want to consider that, "row level" checksums would maybe be 
the lesser evil. But I'm not fond of the overhead in case of tiny rows and the 
'check checksums on read digest mismatch', while I believe necessary in that 
case, doesn't sound like the best idea ever.


> Cassandra cannot detect corrupt-but-readable column data
> --------------------------------------------------------
>
>                 Key: CASSANDRA-1717
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1717
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Pavel Yaskevich
>             Fix For: 1.0
>
>         Attachments: checksums.txt
>
>
> Most corruptions of on-disk data due to bitrot render the column (or row) 
> unreadable, so the data can be replaced by read repair or anti-entropy.  But 
> if the corruption keeps column data readable we do not detect it, and if it 
> corrupts to a higher timestamp value can even resist being overwritten by 
> newer values.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-1717) Cassandra cannot detect corrupt-but-readable column data

Reply via email to