(gurus, please check my logic here... I'm trying to validate my
understanding of this situation.)

Isn't the issue that while a server was disconnected, a delete could have
occurred, and thus the disconnected server never got the 'tombstone'?
(http://wiki.apache.org/cassandra/DistributedDeletes)  When it comes back,
only after it receives the delete request will the data be deleted from the
reconnected server.  I do not think this happens automatically when the
server rejoins the cluster, but requires the manual repair command.

>From my understanding, if the consistency level is greater then the number
of servers missing that tombstone, you'll get the correct data. If its less,
then you 'could' get the right or wrong answer. So the issue is how often do
you need to run repair? If you have a ReplicationFactor=3, and you use
ConstencyLevel.QUORUM, (2 responses) then you need to run it after one
server fails just to be sure. If you can handle some tolerance for this, you
can wait a bit more before running the repair.

On Tue, Aug 17, 2010 at 12:58 PM, Jeremy Dunck <jdu...@gmail.com> wrote:

> On Tue, Aug 17, 2010 at 2:49 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
> > It doesn't have to be disconnected more than GC grace seconds to cause
> > what you are seeing, it just has to be disconnected at all (thus
> > missing delete commands).
> >
> > Thus you need to be running repair more often than gcgrace, or
> > confident that read repair will handle it for you (which clearly is
> > not the case for you :).  see
> > http://wiki.apache.org/cassandra/Operations
>
> FWIW, the docs there say:
> "Remember though that if a node is down longer than your configured
> GCGraceSeconds (default: 10 days), it could have missed remove
> operations permanently"
>
> So that's probably a source of misunderstanding.
>



-- 
Virtually, Ned Wolpert

"Settle thy studies, Faustus, and begin..."   --Marlowe

Reply via email to