>In the next iteration, I am going to address the below point.
>> I would like to see more abstraction of how the files get moved / put in
place with the proposed solution being the default implementation. That
would allow others to plug in alternatives means of data movement like
pulling down backups from S3 or rsync, etc.

I have added details to the "New or Changed Public Interfaces" section.

>Sidecar has now the ability to restore data from S3 (although the restores
are for bulk write jobs coming from the Cassandra Analytics library).

Francisco, can you share a JIRA ticket or some references to it?

>There are some components which may be mutated and therefore their
checksum may need to be recomputed.
Initially I thought just running sstableverify (or nodetool verify) would
be sufficient but looks like it has its problems (CASSANDRA-9947,
<https://issues.apache.org/jira/browse/CASSANDRA-9947> CASSANDRA-17017
<https://issues.apache.org/jira/browse/CASSANDRA-17017> & CASSANDRA-12682
<https://issues.apache.org/jira/browse/CASSANDRA-12682>). Maybe I am
exaggerating the problem. I would wait for expert opinion here.

In this case, Sidecar needs to verify all content (not just sstables). It
needs to cover commit log, hints etc... I was thinking of giving a flexible
option to use different digest algorithms with some caching/optimization at
Sidecar with the file digest endpoint (which considers each and every
file). If verifying sstables is good enough, then probably it can be
skipped.


On Mon, Apr 29, 2024 at 10:37 PM Dinesh Joshi <djo...@apache.org> wrote:

> On Tue, Apr 23, 2024 at 11:37 AM Venkata Hari Krishna Nukala <
> n.v.harikrishna.apa...@gmail.com> wrote:
>
>> reason why I called out binary level verification out of initial scope is
>> because of these two reasons: 1) Calculating digest for each file may
>> increase CPU utilisation and 2) Disk would also be under pressure as
>> complete disk content will also be read to calculate digest. As called out
>> in the discussion, I think we can't
>>
>
> We should have a digest / checksum for each of the file components
> computed and stored on disk so this doesn't need to be recomputed each
> time. Most files / components are immutable and therefore their checksum
> won't change. There are some components which may be mutated and therefore
> their checksum may need to be recomputed. However, data integrity is not
> something we can compromise on. On the receiving node, CPU utilization is
> not a big issue as that node isn't servicing traffic.
>
> I was too lazy to dig into the code and someone who is more familiar with
> the SSTable components / file format can help shed light on checksums.
>

Reply via email to