Re: Is it safe to stop a read repair and any suggestion on speeding up repairs

aaron morton Thu, 21 Jul 2011 15:27:32 -0700

nit pick: nodetool repair is just called repair (or the Anti Entropy Service). 
Read Repair is something that happens during a read request.

Short answer, yes it's safe to kill cassandra during a repair. It's one of the 
nice things about never mutating data. 

Longer answer: If nodetool compactionstats says there are no Validation 
compactions running (and the compaction queue is empty)  and netstats says 
there is nothing streaming there is a a good chance the repair is finished or 
dead. If a neighbour dies during a repair the node it was started on will wait 
for 48 hours(?) until it times out. Check the logs on the machines for errors, 
particularly from the AntiEntropyService. And see what compactionstats is 
saying on all the nodes involved in the repair.

Even Longer: um, 3 TB of data is *way* to much data per node, generally happy 
people have up to about 200 to 300GB per node. The reason for this 
recommendation is so that things like repair, compaction, node moves, etc are 
managable  and because the loss of a single node has less of an impact. I would 
not recommend running a live system with that much data per node. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 22 Jul 2011, at 03:51, Adi wrote:

> We have a 4 node 0.7.6 cluster. RF=2 , 3 TB data per node. 
> A read repair was kicked off on node 4 last week and is still in progress. 
> Later I kicked of read repair on node 2 a few days back.
> We were writing(read/write/updates/NO deletes) data while the repair was in 
> progress but no data has been written for the past 3-4 days. 
> I was hoping the repair should get done in that time-frame before proceeding 
> with further writes/deletes.
> 
> Would it be safe to stop it and kick it off per column family or do a full 
> scan of all keys as suggested in an earlier discussion? Any other suggestion 
> on hastening this repair.
> 
> On both nodes the repair Thread is waiting at this stage for a long time(~60+ 
> hours)
>  java.lang.Thread.State: WAITING
>       at java.lang.Object.wait(Native Method)
>       - waiting on <580857f3> (a org.apache.cassandra.utils.SimpleCondition)
>       at java.lang.Object.wait(Object.java:485)
>       at 
> org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:38)
>       at 
> org.apache.cassandra.service.AntiEntropyService$RepairSession.run(AntiEntropyService.java:791)
>    Locked ownable synchronizers:
>       - None
> A CPU sampling for few minutes shows these methods as hot spots(mostly the 
> top two)
> org.apache.cassandra.db.ColumnFamilyStore.isKeyInRemainingSSTables( )
> org.apache.cassandra.utils.BloomFilter.getHashBuckets( ) 
> org.apache.cassandra.io.sstable.SSTableIdentityIterator.echoData()
> 
> netstats does not show anything streaming to/from any of the nodes.
> 
> -Adi Pandit
>

Re: Is it safe to stop a read repair and any suggestion on speeding up repairs

Reply via email to