Hi all,

Many people here have troubles with repair so I would like to share my 
experience regarding the backport of CASSANDRA-12580 "Fix merkle tree size 
calculation" (thanks Paulo!) in our C* 2.1.16. I was expecting some minor 
improvements but the results are impressive on some tables.

Because of a slow VPN between our EU and US AWS DCs, the massive drop of 
overstreaming is a big win for us. On top of that, before the backport I used 
to see many RepairException that increased during each repair. With this fix 
the graph shows only one exception on one node, so we can say it's negligible. 
Such exceptions are not critical because Cassandra-reaper makes a retry but 
it's a waste of time.


I run a repair on tables set by set (some sets of tables being more critical, 
etc.).
The most impressive result so far for a set is:
* Before: 23 days (days, not hours)
* With CASSANDRA-12580: 16 hours (yes, hours!)

The improvement is not always dramatic (e.g. 8 hours instead of 39 hours on 
another set) but still significant and valuable.

Moreover, considering that:
* repair is a mandatory operation in many use cases
* Paulo already made the patch for 2.1
* C* 2.1 is widely used (the most used?)
I think this bugfix is critical - from an Ops point of view - and should land 
in 2.1.17 to be available to people that don't deploy from sources.

Best,

Romain

Reply via email to