[
https://issues.apache.org/jira/browse/CASSANDRA-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079900#comment-13079900
]
Sylvain Lebresne commented on CASSANDRA-1717:
---------------------------------------------
My 2 cents:
I see 3 options that seems to make sense somehow:
# checksums at the column level:
** pros: easy to do, easy to recover from a bitrot and efficiently so
(efficiently in that in general we would be able to only drop one column for a
given bitrot; it's more complicated if something in the row header (row key,
row size, ...) is bitrotten though).
** cons: high overhead (mainly in disk space usage but also on cpu usage
because we have much more checksums to check)
# checksums at the row level (or column index level, but I think this is
essentially the same, isn't it?):
** pros: easy to recover from bitrot (we drop the row), though potentially
more wasteful than "column level". Incurs a small space overhead for big rows.
** cons: can't realistically check on every reads, so we need to do it only
on compaction/repair and on read digest mismatch (that last one is non optional
if we want checksums to be sure in that bitrot never propagate to other node);
this adds complexity and some I/O to check checksums on read digest mismatch
that is not necessary (read digest mismatch won't in general be due to bitrot).
Also incurs a important space overhead for tiny rows.
# checksums at the block level:
** pros: super easy in the compressed case (can be done "on every read", or
more precisely each time we read a block). Incurs a minimum overhead.
** cons: super *not* easy in the non-compressed case. We don't have blocks in
the uncompressed case. While writing, we could use the buffer size as a block
size and add a checksum on flush. The problems are on reads however. First, we
would need to align buffers on reads (which we don't do in the non-compressed
case) as Pavel said, which likely involves more reBuffer in general (aka more
I/O). But perhaps more importantly, I have no clue how you could make that work
with mmap efficiently (we would potentially have a checksum in the middle of a
column value as far as mmap is concerned). Also slightly harder to recover
from bitrot without dropping the whole sstable (but doable as long as we have
the index around).
There may be other solutions I don't see, and there may be some pros/cons for
the ones above that I have missed (please feel free to complete).
But based on those, my personal opinion is that "column level" has too big an
overhead and "block level" is really problematic in the mmap non-compressed
case (but sound like the best option to me if we ignore mmap).
So my personal preference leans towards using "block level" but only having
checksums in the compressed case and maybe in an uncompressed mode for which
mmap would be deactivated.
If we really don't want to consider that, "row level" checksums would maybe be
the lesser evil. But I'm not fond of the overhead in case of tiny rows and the
'check checksums on read digest mismatch', while I believe necessary in that
case, doesn't sound like the best idea ever.
> Cassandra cannot detect corrupt-but-readable column data
> --------------------------------------------------------
>
> Key: CASSANDRA-1717
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1717
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Jonathan Ellis
> Assignee: Pavel Yaskevich
> Fix For: 1.0
>
> Attachments: checksums.txt
>
>
> Most corruptions of on-disk data due to bitrot render the column (or row)
> unreadable, so the data can be replaced by read repair or anti-entropy. But
> if the corruption keeps column data readable we do not detect it, and if it
> corrupts to a higher timestamp value can even resist being overwritten by
> newer values.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira