[
https://issues.apache.org/jira/browse/CASSANDRA-19369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814508#comment-17814508
]
Francisco Guerrero commented on CASSANDRA-19369:
------------------------------------------------
[~smiklosovic] Cassandra Analytics creates an SSTable during bulk writes. For
each SSTable component generated we calculate the digest of each file (this
includes the crc32 file), which is then uploaded. The purpose of this checksum
is to prevent file integrity of each of the SSTable component files, rather
than integrity of the data file.
For data integrity, bulk writer does the following:
- Checksums of each file generated
- Re-read the generated SSTable file and ensure that what was written is the
same as what we read.
- Transfer the file with a checksum header
- (On Sidecar) Validate that the checksum matches the uploaded file
> [Analytics] Use XXHash32 for digest calculation of SSTables
> -----------------------------------------------------------
>
> Key: CASSANDRA-19369
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19369
> Project: Cassandra
> Issue Type: Improvement
> Components: Analytics Library
> Reporter: Francisco Guerrero
> Assignee: Francisco Guerrero
> Priority: Normal
> Time Spent: 10m
> Remaining Estimate: 0h
>
> During bulk writes, Cassandra Analytics calculates the MD5 checksum of every
> SSTable it produces. During SSTable upload to Cassandra Sidecar, Cassandra
> Analytics includes the {{content-md5}} header as part of the upload request.
> This information is used by Cassandra Sidecar to validate the integrity of
> the uploaded SSTable and prevent issues with bit flips and corrupted SSTables.
> Recently, Cassandra Sidecar introduced [support for additional checksum
> validations|https://issues.apache.org/jira/browse/CASSANDRASC-97] during
> SSTable upload. Notably the XXHash32 digest support was added which offers
> for more performant checksum calculations. This support now allows Cassandra
> Analytics to use a more efficient digest algorithm that is friendlier on the
> CPU usage of Sidecar and spark resources.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]