Change your consistency levels in the cqlsh shell while you query, from ONE
to QUORUM to ALL. If you see your results change that's a consistency
issue. (Assuming these are simple inserts, if there's deletes, potentially
update collections, etc. in the mix then things get a bit more complex.)

To diagnose why the issue exists, a helpful metric are the various dropped
messages metrics from nodetool tpstats. Overloaded clusters will experience
consistency issues as a result of dropped mutations.

It's helpful to think of things in terms of guarantees. If you write with
CL=ONE or LOCAL_ONE, you're getting exactly one guaranteed write. In a
healthy system with tons of excess capacity, you will likely see much
better consistency than that; the hint system will replicate the write to
other nodes, which will perform the write if they can. Since it appears
you're seeing inconsistency at CL=ONE, plus timeouts at CL=QUORUM, it's
quite likely your cluster is not capable of keeping up with the consistency
level you require.

Why your cluster is overloaded is another question entirely, but if you
discover that's the case in my experience the most common cases are
excessive GC due to bad heap settings and data model issues that cause
massive partitions.

On Tue, Feb 14, 2017 at 2:03 PM, Josh England <j...@tgsmc.com> wrote:

> I suspect this is true, but it has proven to be significantly harder to
> track down.  Either cassandra is tickling some bug that nothing else does
> or something strange is going on internally.  On an otherwise quiet system,
> I'd see instant results most of the time intermixed with queries (reads)
> that would timeout and fail.  I agree this needs to be addressed but I'd
> like to understand what is currently going on with my queries.  If it is
> thought to be a consistency problem, how can that be verified?
>
> -JE
>
>
> On Tue, Feb 14, 2017 at 1:46 PM, Jonathan Haddad <j...@jonhaddad.com>
> wrote:
>
>> If you're getting a lot of timeouts you will almost certainly end up with
>> consistency issues. You're going to need to fix the root cause, your
>> cluster instability, or this sort of issue will be commonplace.
>>
>>
>> On Tue, Feb 14, 2017 at 1:43 PM Josh England <j...@tgsmc.com> wrote:
>>
>>> I'll try it the repair.  Using quorum tends to lead to too many timeout
>>> problems though.  :(
>>>
>>> -JE
>>>
>>>
>>> On Tue, Feb 14, 2017 at 1:39 PM, Oskar Kjellin <oskar.kjel...@gmail.com>
>>> wrote:
>>>
>>> Repair might help. But you will end up in this situation again unless
>>> you read/write using quorum (may be local)
>>>
>>> Sent from my iPhone
>>>
>>> On 14 Feb 2017, at 22:37, Josh England <j...@tgsmc.com> wrote:
>>>
>>> All client interactions are from python (python-driver 3.7.1) using
>>> default consistency (LOCAL_ONE I think).  Should I try repairing all nodes
>>> to make sure all data is consistent?
>>>
>>> -JE
>>>
>>>
>>> On Tue, Feb 14, 2017 at 1:32 PM, Oskar Kjellin <oskar.kjel...@gmail.com>
>>> wrote:
>>>
>>> What consistency levels are you using for reads/writes?
>>>
>>> Sent from my iPhone
>>>
>>> > On 14 Feb 2017, at 22:27, Josh England <j...@tgsmc.com> wrote:
>>> >
>>> > I'm running Cassandra 3.9 on CentOS 6.7 in a 6-node cluster.  I've got
>>> a situation where the same query sometimes returns 2 records (correct), and
>>> sometimes only returns 1 record (incorrect).  I've ruled out the
>>> application and the indexing since this is reproducible directly from a
>>> cqlsh shell with a simple select statement.  What is the best way to debug
>>> what is happening here?
>>> >
>>> > -JE
>>> >
>>>
>>>
>>>
>>>
>

Reply via email to