>In the next iteration, I am going to address the below point. >> I would like to see more abstraction of how the files get moved / put in place with the proposed solution being the default implementation. That would allow others to plug in alternatives means of data movement like pulling down backups from S3 or rsync, etc.
I have added details to the "New or Changed Public Interfaces" section. >Sidecar has now the ability to restore data from S3 (although the restores are for bulk write jobs coming from the Cassandra Analytics library). Francisco, can you share a JIRA ticket or some references to it? >There are some components which may be mutated and therefore their checksum may need to be recomputed. Initially I thought just running sstableverify (or nodetool verify) would be sufficient but looks like it has its problems (CASSANDRA-9947, <https://issues.apache.org/jira/browse/CASSANDRA-9947> CASSANDRA-17017 <https://issues.apache.org/jira/browse/CASSANDRA-17017> & CASSANDRA-12682 <https://issues.apache.org/jira/browse/CASSANDRA-12682>). Maybe I am exaggerating the problem. I would wait for expert opinion here. In this case, Sidecar needs to verify all content (not just sstables). It needs to cover commit log, hints etc... I was thinking of giving a flexible option to use different digest algorithms with some caching/optimization at Sidecar with the file digest endpoint (which considers each and every file). If verifying sstables is good enough, then probably it can be skipped. On Mon, Apr 29, 2024 at 10:37 PM Dinesh Joshi <djo...@apache.org> wrote: > On Tue, Apr 23, 2024 at 11:37 AM Venkata Hari Krishna Nukala < > n.v.harikrishna.apa...@gmail.com> wrote: > >> reason why I called out binary level verification out of initial scope is >> because of these two reasons: 1) Calculating digest for each file may >> increase CPU utilisation and 2) Disk would also be under pressure as >> complete disk content will also be read to calculate digest. As called out >> in the discussion, I think we can't >> > > We should have a digest / checksum for each of the file components > computed and stored on disk so this doesn't need to be recomputed each > time. Most files / components are immutable and therefore their checksum > won't change. There are some components which may be mutated and therefore > their checksum may need to be recomputed. However, data integrity is not > something we can compromise on. On the receiving node, CPU utilization is > not a big issue as that node isn't servicing traffic. > > I was too lazy to dig into the code and someone who is more familiar with > the SSTable components / file format can help shed light on checksums. >