Re: Question regarding the need to run nodetool repair

2012-11-15 Thread Edward Capriolo
On Thursday, November 15, 2012, Dwight Smith dwight.sm...@genesyslab.com
wrote:
 I have a 4 node cluster,  version 1.1.2, replication factor of 4,
read/write consistency of 3, level compaction. Several questions.



 1)  Should nodetool repair be run regularly to assure it has
completed before gc_grace?  If it is not run, what are the exposures?

Yes. Lost tombstones could cause deleted data to re appear.

 2)  If a node goes down, and is brought back up prior to the 1 hour
hinted handoff expiration, should repair be run immediately?

If node is brought up prior to 1 hour. You should let the hints replay.
 Repair is always safe to run.

 3)  If the hinted handoff has expired, the plan is to remove the node
and start a fresh node in its place.  Does this approach cause problems?

You only need to join a fresh mode if the node was down longer then gc
grace. Default is 10 days.


 Thanks



If you read and write at quorum and run repair regularly you can worry less
about the things above because they are essentially non factors.


RE: Question regarding the need to run nodetool repair

2012-11-15 Thread Dwight Smith
Thanks

 

From: Edward Capriolo [mailto:edlinuxg...@gmail.com] 
Sent: Thursday, November 15, 2012 4:30 PM
To: user@cassandra.apache.org
Subject: Re: Question regarding the need to run nodetool repair

 



On Thursday, November 15, 2012, Dwight Smith
dwight.sm...@genesyslab.com wrote:
 I have a 4 node cluster,  version 1.1.2, replication factor of 4,
read/write consistency of 3, level compaction. Several questions.

  

 1)  Should nodetool repair be run regularly to assure it has
completed before gc_grace?  If it is not run, what are the exposures?

Yes. Lost tombstones could cause deleted data to re appear.

 2)  If a node goes down, and is brought back up prior to the 1
hour hinted handoff expiration, should repair be run immediately?

If node is brought up prior to 1 hour. You should let the hints replay.
Repair is always safe to run.

 3)  If the hinted handoff has expired, the plan is to remove the
node and start a fresh node in its place.  Does this approach cause
problems?

You only need to join a fresh mode if the node was down longer then gc
grace. Default is 10 days.
  

 Thanks



If you read and write at quorum and run repair regularly you can worry
less about the things above because they are essentially non factors. 



Re: Question regarding the need to run nodetool repair

2012-11-15 Thread Rob Coli
On Thu, Nov 15, 2012 at 4:12 PM, Dwight Smith
dwight.sm...@genesyslab.com wrote:
 I have a 4 node cluster,  version 1.1.2, replication factor of 4, read/write
 consistency of 3, level compaction. Several questions.

Hinted Handoff is broken in your version [1] (and all versions between
1.0.0 and 1.0.3 [2]). Upgrade to 1.1.6 ASAP so that the answers below
actually apply, because working Hinted Handoff is involved.

 1)  Should nodetool repair be run regularly to assure it has completed
 before gc_grace?  If it is not run, what are the exposures?

If you do DELETE logical operations, yes. If not, no. gc_grace_seconds
only applies to tombstones, and if you do not delete you have no
tombstones. If you only DELETE in one columnfamily, that is the only
one you have to repair within gc_grace.

Exposure is zombie data, where a node missed a DELETE (and associated
tombstone) but had a previous value for that column or row and this
zombie value is resurrected and propagated by read repair.

 2)  If a node goes down, and is brought back up prior to the 1 hour
 hinted handoff expiration, should repair be run immediately?

In theory, if hinted handoff is working, no. This is a good thing
because otherwise simply restarting a node would trigger the need for
repair. In practice I would be shocked if anyone has scientifically
tested it to the degree required to be certain all edge cases are
covered, so I'm not sure I would rely on this being true. Especially
as key components of this guarantee such as Hinted Handoff can be
broken for 3-5 point releases before anyone notices.

It is because of this uncertainty that I recommend periodic repair
even in clusters that don't do DELETE.

 3)  If the hinted handoff has expired, the plan is to remove the node
 and start a fresh node in its place.  Does this approach cause problems?

Yes.

1) You've lost any data that was only ever replicated to this node.
With RF=3, this should be relatively rare, even with CL.ONE, because
writes are much more likely to succeed-but-report-they-failed than
vice versa. If you run periodic repair, you cover the case where
something gets under-replicated and then even less replicated as nodes
are replaced.
2) When you replace the node in its place (presumably using
replace_token) you will only stream the relevant data from a single
other replica. This means that, given 3 nodes A B C where datum X is
on A and B, and B fails, it might be bootstrapped using C as a source,
decreasing your replica count of X by 1.

In order to deal with these issues, you need to run a repair of the
affected node after bootstrapping/replace_tokening. Until this repair
completes, CL.ONE reads might be stale or missing. I think what
operators really want is a path by which they can bootstrap and then
repair, before returning the node to the cluster. Unfortunately there
are significant technical reasons which prevent this from being
trivial.

As such, I suggest increasing gc_grace_seconds and
max_hint_window_in_ms to reduce the amount of repair you need to run.
The negative to increasing gc_grace is that you store tombstones for
longer before purging them. The negative to increasing
max_hint_window_in_ms is that hints for a given token are stored in
one row.. and very wide rows can exhibit pathological behavior.

Also if you set max_hint_window_in_ms too high, you could cause
cascading failure as nodes fill with hints, become less performant...
thereby increasing the cluster-wide hint rate. Unless you have a very
high write rate or really lazy ops people who leave nodes down for
very long times, the cascading failure case is relatively unlikely.

=Rob

[1] https://issues.apache.org/jira/browse/CASSANDRA-4772
[2] https://issues.apache.org/jira/browse/CASSANDRA-3466


-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb