Just to clarify a little further, it's true that read repair queries are performed at CL ALL, but this is slightly different to a regular, user-initiated query at that CL.
Say you have RF=5 and you issue read at CL ALL, the coordinator will send requests to all 5 replicas and block until it receives a response from each (or a timeout occurs) before replying to the client. This is the straightforward and intuitive case. If instead you read at CL QUORUM, the # of replicas required for CL is 3, so the coordinator only contacts 3 nodes. In the case where a speculative retry is activated, an additional replica is added to the initial set. The coordinator will still only wait for 3 out of the 4 responses before proceeding, but if a digest mismatch occurs the read repair queries are sent to all 4. It's this follow up query that the coordinator executes at CL ALL, i.e. it requires all 4 replicas to respond to the read repair query before merging their results to figure out the canonical, latest data. You can see that the number of replicas queried/required for read repair is different than if the client actually requests a read at CL ALL (i.e. here it's 4, not 5), it's the behaviour of waiting for all *contacted* replicas to respond which is significant here. There are addtional considerations when constructing that initial replica set (which you can follow in o.a.c.Service.AbstractReadExecutor::getReadExecutor), involving the table's read_repair_chance, dclocal_read_repair_chance and speculative_retry options. THe main gotcha is global read repair (via read_repair_chance) which will trigger cross-dc repairs at CL ALL in the case of a digest mismatch, even if the requested CL is DC-local. On Sun, Aug 28, 2016 at 11:55 AM, Ben Slater <ben.sla...@instaclustr.com> wrote: > In case anyone else is interested - we figured this out. When C* decides > it need to do a repair based on a digest mismatch from the initial reads > for the consistency level it does actually try to do a read at CL=ALL in > order to get the most up to date data to use to repair. > > This led to an interesting issue in our case where we had one node in an > RF3 cluster down for maintenance (to correct data that became corrupted due > to a severe write overload) and started getting occasional “timeout during > read query at consistency LOCAL_QUORUM” failures. We believe this due to > the case where data for a read was only available on one of the two up > replicas which then triggered an attempt to repair and a failed read at > CL=ALL. It seems that CASSANDRA-7947 (a while ago) change the behaviour so > that C* reports a failure at the originally request level even when it was > actually the attempted repair read at CL=ALL which could not read > sufficient replicas - a bit confusing (although I can also see how getting > CL=ALL errors when you thought you were reading at QUORUM or ONE would be > confusing). > > Cheers > Ben > > On Sun, 28 Aug 2016 at 10:52 kurt Greaves <k...@instaclustr.com> wrote: > >> Looking at the wiki for the read path (http://wiki.apache.org/ >> cassandra/ReadPathForUsers), in the bottom diagram for reading with a >> read repair, it states the following when "reading from all replica nodes" >> after there is a hash mismatch: >> >> If hashes do not match, do conflict resolution. First step is to read all >>> data from all replica nodes excluding the fastest replica (since CL=ALL) >>> >> >> In the bottom left of the diagram it also states: >> >>> In this example: >>> >> RF>=2 >>> >> CL=ALL >>> >> >> The (since CL=ALL) implies that the CL for the read during the read >> repair is based off the CL of the query. However I don't think that makes >> sense at other CLs. Anyway, I just want to clarify what CL the read for the >> read repair occurs at for cases where the overall query CL is not ALL. >> >> Thanks, >> Kurt. >> >> -- >> Kurt Greaves >> k...@instaclustr.com >> www.instaclustr.com >> > -- > ———————— > Ben Slater > Chief Product Officer > Instaclustr: Cassandra + Spark - Managed | Consulting | Support > +61 437 929 798 >