@Joseph,
An incident we saw in production, and have a speculation as to how it might
have occured.

*A detailed description of use case*

*Incident*
We have a 2 DCs each with three nodes.
And our keyspace has RF 3 per DC. read_repair_chance is 0.0 for all the
tables.
After a while(we run periodic full table scans to dump data someplace
else), we saw corrupted data being dumped.
We copied the ss tables of all node of one DC to a separate cluster created
for debugging.
     We shutdown two nodes of the replica cluster, so that only one was up,
and made queries on cqlsh for the possibly corrupted data.
     What we saw was. out of the three nodes of replica, two has similar
data, and one had some extra data which shouldn't have been there for that
particular partition key.



*Speculation*
A possible cause we could come up with was, on a particular day, one of the
nodes of the production DC might have gone down. And that time might have
crossed the hinted_handoff_window.
Say, node went down on 12PM
Coordinator nodes stored hints from 12PM - 3PM.
Node was started on 6PM
All deletions/updates 3PM-6PM were not on our particular node.
And repair wasn't run on that node. After 10 days, tombstones
deleted(gc_grace_seconds).
Now that particular node still has data which was missed in deletion, and
the data has been removed from other two nodes.
So, we can't run repair now.

Again, it is a possible speculation. We are not sure. This is the only
cause we could come up with


@User
Back to the requirement "*Read data from specific node in cassandra*"
I prematurely stated whitelist worked *perfectly. *However, while scanning
the data, it isn't the case. It has caused ambiguous data dump.
This option didn't work for debugging.
Could someone suggest other alternatives?

Reply via email to