[ https://issues.apache.org/jira/browse/CASSANDRA-19918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900423#comment-17900423 ]
Jaydeepkumar Chovatia commented on CASSANDRA-19918: --------------------------------------------------- Here are the CircleCI runs on _trunk_ PR ([https://github.com/apache/cassandra/pull/3598]) ||Heading 1||Heading 2||Heading 3|| |*Java-11 Pre-commit*|*Java-11 Separate*|*Java-17 Pre-commit*|*Java-17 Separate*| |[link|https://app.circleci.com/pipelines/github/jaydeepkumar1984/cassandra/277/workflows/83353273-6972-4fcb-8c92-4bac4415ee9e]|[link|https://app.circleci.com/pipelines/github/jaydeepkumar1984/cassandra/277/workflows/bb6e98af-451f-474b-99d4-d42d8062a55d]|[link|https://app.circleci.com/pipelines/github/jaydeepkumar1984/cassandra/277/workflows/37b92790-494d-421e-962a-c2eaea3e5574]|[link|https://app.circleci.com/pipelines/github/jaydeepkumar1984/cassandra/277/workflows/48065942-bc62-4a7c-ab28-d72c93afad61]| Many of the tests are failing, and here are the following two categories that define the failures: * *category-1 dtest cqlsh:* It is due to a newly added CQL table property. This dtest PR fixes them, and the dtest PR mentioned in the description must be submitted simultaneously as the trunk PR. * *category-2 other tests/utests:* Due to context timeout, perhaps a scarcity of resources in CircleCI > Automated Repair Inside Cassandra > --------------------------------- > > Key: CASSANDRA-19918 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19918 > Project: Cassandra > Issue Type: Epic > Reporter: Jaydeepkumar Chovatia > Assignee: Jaydeepkumar Chovatia > Priority: Normal > > h1. Motivation > Anti-entropy (Apache Cassandra repairs) is essential for every Apache > Cassandra cluster to fix data inconsistencies. Frequent data deletions and > downed nodes are common causes of data inconsistency. A few open-source > orchestration solutions that trigger repair externally are available, as many > large users have needed to figure out a scalable repair solution. However, > multiple custom solutions have led to a lot of confusion in the community. > Therefore, the repair activity, like Compaction, should be an integral part > of Cassandra to call it a complete solution. > > The proposal is to align one solution among the existing solutions and make > it part of the core Cassandra. Here is the design for one of the solutions: > > Inside Cassandra, there are multiple repairs we would have to schedule: > 1) Full repair > 2) Incremental Repair > 3) Paxos repair > > The design of the scheduler should be capable of extending multiple repair > categories with a minimal code change, and all repair types should progress > automatically with minimal manual intervention. > Migrating[[1|https://stackoverflow.com/questions/42182984/how-do-i-enable-incremental-repair-on-cassandra-2-1-13]] > (and rollback) to/from incremental repair has been extremely challenging, > especially in a large fleet. One of the design principles is to make it > almost touchless from the operator’s point of view. > h1. The Scheduler > Keeping the above motivation in mind, this design embarks on our journey to > have the repair orchestration inside Cassandra itself, which will repair the > entire ring. > A dedicated thread pool is assigned to the repair scheduler at a higher > level. The repair scheduler inside Cassandra maintains a new replicated table > under a distributed _system_distributed_ keyspace. This table maintains the > repair history for all the nodes, such as when it was repaired the last time, > etc. The scheduler will pick the node(s) that run the repair first and > continue orchestration to ensure Every table and all of their token ranges > are repaired. The algorithm can also run repairs simultaneously on multiple > nodes and splits the token range into subranges with the necessary retry to > handle transient failures. Over the period, the automatic repair has become > so reliable that it runs as soon as we start a Cassandra cluster, like > Compaction, and does not require manual intervention. > Due to this fully automated repair scheduler inside Cassandra, there is no > dependency on the control plane, significantly reducing our operational > overhead. > *CEP:* > [https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+%28DRAFT%29+Apache+Cassandra+Unified+Repair+Solution] > h2. Detailed Design Doc > [Automated Repair in > Cassandra|https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0] > h2. PR (on 4.1.6) (Last active: Sep 2024) > Many folks currently are using 4.1.6 in production. Hence, the following PR > on 4.1.6 will make it easier for everybody to review the code, test, etc. If > the community decides to merge this CEP, then it will land on the _trunk_ as > opposed to {_}4.1{_}. > [https://github.com/apache/cassandra/pull/3367/] > h2. PR (on {_}trunk{_}) (Last active: Sep 2024) > [https://github.com/apache/cassandra/pull/3598] > h2. PR (dtest) (Last active: Oct 2024) > [https://github.com/apache/cassandra-dtest/pull/270] > h2. Discussion over Slack > [[1]|https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619] > [[2]|http://cassandra-repair-scheduling-cep37/] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org