IR patch is up for review https://issues.apache.org/jira/browse/CASSANDRA-18962
> On Oct 24, 2023, at 3:15 PM, David Capwell <dcapw...@apple.com> wrote: > > I sat down to add IR messages to the mix… given how positive the feedback was > for other repair messages I assume people are still ok with this new IR work > going to 5.0 as well, if not please let me know here (will send a patch > tomorrow). > > Once I send out the patch 100% of repair messages have retry logic > >> On Sep 26, 2023, at 12:08 PM, David Capwell <dcapw...@apple.com> wrote: >> >> Thanks all for the feedback! The patch has 2 +1s on trunk and back ported >> to 5.0, making sure it’s stable now; I plan to merge early this week. >> >>> On Sep 21, 2023, at 2:07 PM, Ekaterina Dimitrova <e.dimitr...@gmail.com> >>> wrote: >>> >>> +1 from me too. Moreover, this work has started as part of the test efforts >>> and identifying weak points during the 4.0 testing, if I recall correctly. >>> 5.0 sounds like a good place to land. Thank you David and everyone else >>> involved for your efforts! >>> >>> On Thu, 21 Sep 2023 at 1:01, Berenguer Blasi <berenguerbl...@gmail.com >>> <mailto:berenguerbl...@gmail.com>> wrote: >>>> +1 I agree with Brandon. It's more like a bug imo. >>>> On 20/9/23 21:42, Caleb Rackliffe wrote: >>>>> +1 on a 5.0 backport >>>>> >>>>> On Wed, Sep 20, 2023 at 2:26 PM Brandon Williams <dri...@gmail.com >>>>> <mailto:dri...@gmail.com>> wrote: >>>>>> I think it could be argued that not retrying messages is a bug, I am >>>>>> +1 on including this in 5.0. >>>>>> >>>>>> Kind Regards, >>>>>> Brandon >>>>>> >>>>>> On Tue, Sep 19, 2023 at 1:16 PM David Capwell <dcapw...@apple.com >>>>>> <mailto:dcapw...@apple.com>> wrote: >>>>>> > >>>>>> > To try to get repair more stable, I added optional retry logic (patch >>>>>> > is still in review) to a handful of critical repair verbs. This patch >>>>>> > is disabled by default but allows you to opt-in to retries so >>>>>> > ephemeral issues don’t cause a repair to fail after running for a long >>>>>> > time (assuming they resolve within the retry window). There are 2 >>>>>> > protocol level changes to enable this: VALIDATION_RSP and SYNC_RSP now >>>>>> > send an ACK (if the sender doesn’t attach a callback, these ACKs get >>>>>> > ignored in all versions; see org.apache.cassandra.net >>>>>> > <http://org.apache.cassandra.net/>.ResponseVerbHandler#doVerb and >>>>>> > Verb.REPAIR_RSP). Given that we have already forked, I believe we >>>>>> > would need to give a waiver to allow this patch due to this change. >>>>>> > >>>>>> > The patch was written on trunk, but figured back porting 5.0 would be >>>>>> > rather trivial and this was brought up during the review, so floating >>>>>> > this to a wider audience. >>>>>> > >>>>>> > If you look at the patch you will see that it is very large, but this >>>>>> > is only to make testing of repair coordination easier and >>>>>> > deterministic, the biggest code changes are: >>>>>> > >>>>>> > 1) Moving from ActiveRepairService.instance to >>>>>> > ActiveRepairService.instance() (this is the main reason so many files >>>>>> > were touched; this was needed so unit tests don’t load the whole world) >>>>>> > 2) Repair no longer reaches into global space and instead is provided >>>>>> > the subsystems needed to perform repair; this change is local to >>>>>> > repair code >>>>>> > >>>>>> > Both of these changes were only for testing as they allow us to >>>>>> > simulate 1k repairs in around 15 seconds with 100% deterministic >>>>>> > execution. >> >