Just to clarify a little further, it's true that read repair queries are
performed at CL ALL, but this is slightly different to a regular,
user-initiated query at that CL.

Say you have RF=5 and you issue read at CL ALL, the coordinator will send
requests to all 5 replicas and block until it receives a response from each
(or a timeout occurs) before replying to the client. This is the
straightforward and intuitive case.

If instead you read at CL QUORUM, the # of replicas required for CL is 3,
so the coordinator only contacts 3 nodes. In the case where a speculative
retry is activated, an additional replica is added to the initial set. The
coordinator will still only wait for 3 out of the 4 responses before
proceeding, but if a digest mismatch occurs the read repair queries are
sent to all 4. It's this follow up query that the coordinator executes at
CL ALL, i.e. it requires all 4 replicas to respond to the read repair query
before merging their results to figure out the canonical, latest data.

You can see that the number of replicas queried/required for read repair is
different than if the client actually requests a read at CL ALL (i.e. here
it's 4, not 5), it's the behaviour of waiting for all *contacted* replicas
to respond which is significant here.

There are addtional considerations when constructing that initial replica
set (which you can follow in
o.a.c.Service.AbstractReadExecutor::getReadExecutor), involving the table's
read_repair_chance, dclocal_read_repair_chance and speculative_retry
options. THe main gotcha is global read repair (via read_repair_chance)
which will trigger cross-dc repairs at CL ALL in the case of a digest
mismatch, even if the requested CL is DC-local.


On Sun, Aug 28, 2016 at 11:55 AM, Ben Slater <ben.sla...@instaclustr.com>
wrote:

> In case anyone else is interested - we figured this out. When C* decides
> it need to do a repair based on a digest mismatch from the initial reads
> for the consistency level it does actually try to do a read at CL=ALL in
> order to get the most up to date data to use to repair.
>
> This led to an interesting issue in our case where we had one node in an
> RF3 cluster down for maintenance (to correct data that became corrupted due
> to a severe write overload) and started getting occasional “timeout during
> read query at consistency LOCAL_QUORUM” failures. We believe this due to
> the case where data for a read was only available on one of the two up
> replicas which then triggered an attempt to repair and a failed read at
> CL=ALL. It seems that CASSANDRA-7947 (a while ago) change the behaviour so
> that C* reports a failure at the originally request level even when it was
> actually the attempted repair read at CL=ALL which could not read
> sufficient replicas - a bit confusing (although I can also see how getting
> CL=ALL errors when you thought you were reading at QUORUM or ONE would be
> confusing).
>
> Cheers
> Ben
>
> On Sun, 28 Aug 2016 at 10:52 kurt Greaves <k...@instaclustr.com> wrote:
>
>> Looking at the wiki for the read path (http://wiki.apache.org/
>> cassandra/ReadPathForUsers), in the bottom diagram for reading with a
>> read repair, it states the following when "reading from all replica nodes"
>> after there is a hash mismatch:
>>
>> If hashes do not match, do conflict resolution. First step is to read all
>>> data from all replica nodes excluding the fastest replica (since CL=ALL)
>>>
>>
>>  In the bottom left of the diagram it also states:
>>
>>> In this example:
>>>
>> RF>=2
>>>
>> CL=ALL
>>>
>>
>> The (since CL=ALL) implies that the CL for the read during the read
>> repair is based off the CL of the query. However I don't think that makes
>> sense at other CLs. Anyway, I just want to clarify what CL the read for the
>> read repair occurs at for cases where the overall query CL is not ALL.
>>
>> Thanks,
>> Kurt.
>>
>> --
>> Kurt Greaves
>> k...@instaclustr.com
>> www.instaclustr.com
>>
> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>

Reply via email to