[ 
https://issues.apache.org/jira/browse/CASSANDRA-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079632#comment-13079632
 ] 

Todd Lipcon commented on CASSANDRA-1717:
----------------------------------------

xedin on IRC asked me to comment on this issue. For reference of what other 
systems do: HDFS checksums every file in 512-byte chunks with a CRC32. It's 
verified on write (by only the first DN in the pipeline) and on read (by the 
client). If the client gets a checksum error while reading, it will report this 
to the NN, and the NN will mark that block as corrupt, schedule another 
replication, etc.

This is all transparent to the HBase layer since it's done at the FS layer. So, 
HBase itself doesn't do any extra checksumming. If you compress your tables, 
then you might get an extra layer of checksumming for free from gzip as someone 
mentioned above.

For some interesting JIRAs on checksum performance, check out HADOOP-6148 and 
various followups, as well as current work in progress HDFS-2080

> Cassandra cannot detect corrupt-but-readable column data
> --------------------------------------------------------
>
>                 Key: CASSANDRA-1717
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1717
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Pavel Yaskevich
>             Fix For: 1.0
>
>         Attachments: checksums.txt
>
>
> Most corruptions of on-disk data due to bitrot render the column (or row) 
> unreadable, so the data can be replaced by read repair or anti-entropy.  But 
> if the corruption keeps column data readable we do not detect it, and if it 
> corrupts to a higher timestamp value can even resist being overwritten by 
> newer values.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to