> In my experience running repair on some counter data, the size of > streamed data is much bigger than the cluster could possibly have lost > messages or would be due to snapshotting at different times. > > I know the data will eventually be in sync on every repair, but I'm > more interested in whether Cassandra transfers excess data and how to > minimize this. > > Does any body have insights on this? > The problem is in granularity of Merkle tree. Cassandra sends regions which have different hash values. It could be much bigger then a single row.
Andrey