Re: Repair taking a long, long time

aaron morton Wed, 20 Jul 2011 16:16:53 -0700

The first thing to do is understand what the server is doing. 

As Edward said, there are two phases to the repair first the differences are 
calculated and then they are shared between the neighbours. Lets an a third 
step, once the neighbour gets the data it has to rebuild the indexes and bloom 
filter, not huge but lets include it for completeness.


So...

0. Check for ERRORS in the log.
1. check nodetool compactstats , if the Merkle tree build is going on it will 
say "Validation Compaction". Run it twice and check for progress.
2. check nodetool netstats, this will show which segments of the data are been 
streamed. Run it twice and check for progress. 
3. check nodetool compactstats, if the data has completed streaming and indexes 
are been built it will say "SSTable build"

Once we know what stage of the repair your server is at it's possible to reason 
about what is going on.

If you want to dive deeper look for a log messages on the machine you started 
the repair on from the AnitEntropyService. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 21 Jul 2011, at 02:31, David Boxenhorn wrote:

> As I indicated below (but didn't say specifically) another option is to set 
> read repair chance to 1.0 for all your CFs and loop over all your data, since 
> read triggers a read repair. 
> 
> On Wed, Jul 20, 2011 at 4:58 PM, Maxim Potekhin <potek...@bnl.gov> wrote:
> I can re-load all data that I have in the cluster, from a flat-file cache I 
> have
> on NFS, many times faster than the nodetool repair takes. And that's not
> even accurate because as other noted nodetool repair eats up disk space
> for breakfast and takes more than 24hrs on 200GB data load, at which point
> I have to cancel. That's not acceptable. I simply don't know what to do now.
> 
> 
> 
> On 7/20/2011 8:47 AM, David Boxenhorn wrote:
>> 
>> I have this problem too, and I don't understand why.
>> 
>> I can repair my nodes very quickly by looping though all my data (when you 
>> read your data it does read-repair), but nodetool repair takes forever. I 
>> understand that nodetool repair builds merkle trees, etc. etc., so it's a 
>> different algorithm, but why can't nodetool repair be smart enough to choose 
>> the best algorithm? Also, I don't understand what's special about my data 
>> that makes nodetool repair so much slower than looping through all my data.
>> 
>> 
>> On Wed, Jul 20, 2011 at 12:18 AM, Maxim Potekhin <potek...@bnl.gov> wrote:
>> Thanks Edward. I'm told by our IT that the switch connecting the nodes is 
>> pretty fast.
>> Seriously, in my house I copy complete DVD images from my bedroom to
>> the living room downstairs via WiFi, and a dozen of GB does not seem like a
>> problem, on dirt cheap hardware (Patriot Box Office).
>> 
>> I also have just _one_ column major family but caveat emptor -- 8 indexes 
>> attached to
>> it (and there will be more). There is one accounting CF which is small, 
>> can't possibly
>> make a difference.
>> 
>> By contrast, compaction (as in nodetool) performs quite well on this 
>> cluster. I start suspecting some
>> sort of malfunction.
>> 
>> Looked at the system log during the "repair", there is some compaction agent 
>> doing
>> work that I'm not sure makes sense (and I didn't call for it). Disk 
>> utilization all of a sudden goes up to 40%
>> per Ganglia, and stays there, this is pretty silly considering the cluster 
>> is IDLE and we have SSDs. No external writes,
>> no reads. There are occasional GC stoppages, but these I can live with.
>> 
>> This repair debacle happens 2nd time in a row. Cr@p. I need to go to 
>> production soon
>> and that doesn't look good at all. If I can't manage a system that simple 
>> (and/or get help
>> on this list) I may have to cut losses i.e. stay with Oracle.
>> 
>> Regards,
>> 
>> Maxim
>> 
>> 
>> 
>> 
>> On 7/19/2011 12:16 PM, Edward Capriolo wrote:
>> 
>> Well most SSD's are pretty fast. There is one more to consider. If Cassandra 
>> determines nodes are out of sync it has to transfer data across the network. 
>> If that is the case you have to look at 'nodetool streams' and determine how 
>> much data is being transferred between nodes. There are some open tickets 
>> where with larger tables repair is streaming more then it needs to. But even 
>> if the transfers are only 10% of your 200GB. Transferring 20 GB is not 
>> trivial.
>> 
>> If you have multiple keyspaces and column families repair one at a time 
>> might make the process more manageable.
>> 
>> 
> 
>

Re: Repair taking a long, long time

Reply via email to