+1 on a 5.0 backport

On Wed, Sep 20, 2023 at 2:26 PM Brandon Williams <dri...@gmail.com> wrote:

> I think it could be argued that not retrying messages is a bug, I am
> +1 on including this in 5.0.
>
> Kind Regards,
> Brandon
>
> On Tue, Sep 19, 2023 at 1:16 PM David Capwell <dcapw...@apple.com> wrote:
> >
> > To try to get repair more stable, I added optional retry logic (patch is
> still in review) to a handful of critical repair verbs.  This patch is
> disabled by default but allows you to opt-in to retries so ephemeral issues
> don’t cause a repair to fail after running for a long time (assuming they
> resolve within the retry window). There are 2 protocol level changes to
> enable this: VALIDATION_RSP and SYNC_RSP now send an ACK (if the sender
> doesn’t attach a callback, these ACKs get ignored in all versions; see
> org.apache.cassandra.net.ResponseVerbHandler#doVerb and
> Verb.REPAIR_RSP).  Given that we have already forked, I believe we would
> need to give a waiver to allow this patch due to this change.
> >
> > The patch was written on trunk, but figured back porting 5.0 would be
> rather trivial and this was brought up during the review, so floating this
> to a wider audience.
> >
> > If you look at the patch you will see that it is very large, but this is
> only to make testing of repair coordination easier and deterministic, the
> biggest code changes are:
> >
> > 1) Moving from ActiveRepairService.instance to
> ActiveRepairService.instance() (this is the main reason so many files were
> touched; this was needed so unit tests don’t load the whole world)
> > 2) Repair no longer reaches into global space and instead is provided
> the subsystems needed to perform repair; this change is local to repair code
> >
> > Both of these changes were only for testing as they allow us to simulate
> 1k repairs in around 15 seconds with 100% deterministic execution.
>

Reply via email to