Re: ShardSplitTest flakiness investigation

Alex Deparvu Mon, 03 Apr 2023 09:49:34 -0700

An update on the failing test.

The failure comes from documents having null version field from the add
operations running in parallel with the split shard operation.

Digging into this a bit more (and running the test a very large number of
times) I think I identified 3 problematic cases that can cause failures.
will list them in order of importance.

First, documents persisted with null version. I think this is by far the
most important one to fix,and my PR is focused on closing this gap.
The race condition seems to be when the new shard slice will receive
pending updates from the old shard leader and persist them 'as is', but the
trouble is some of these have zero version. so the docs will get persisted
without version, breaking the test's consistency checks. My tentative fix
is to allow the new slice to act as a master and attach a version.
I also added a version integrity check that will not allow leader-received
documents to be persisted without a version. I am considering moving this
behind some sort of system property/flag to disable on demand if it turns
out this is too strict.
As far as testing is concerned: there was no need for more tests, the
current one reproduces this issue easily (unfortunately it did not apply
stricter assertions regarding errors on the parallel operations), so I made
it a tiny bit stricter, in the hope that the test fails early and with a
clearer message.
I am looking for some thoughts and feedback on the PR itself but also on
understanding the impact. This smells like a data corruption situation, but
being reduced to the split shard operation, so maybe the window is not very
big. please correct me if I am overstating the problem.

Second. Another race condition happens on the deletes as well. but due to
the first exception it is invisible. this one is milder, in the sense where
deletes will be rejected, so less data corruption opportunity. here as well
there is some opportunity to close this gap. I tried to do it as a part of
this PR but I hit some weirdness, so instead of sitting on this more, I
would prefer to have it reviewed and approach deletes in a subsequent PR.

Third, I keep seeing yet another exception "Request says it is coming from
parent shard leader but we are in active state", which seems to be more
buffered requests on the old leader being sent over. I did not get any
chance to look at this, so cannot comment on the impact yet.

If anyone has interest (or some knowledge) in this area, please take a look
at the PR and let me know your thoughts.
PR https://github.com/apache/solr/pull/1504

best
alex

On Thu, Mar 30, 2023 at 8:06 AM Kevin Risden <kris...@apache.org> wrote:

> I don't have any helpful pointers as to why this might be happening - but I
> do want to say thanks for digging in and finding out the cause as well as
> finding some related old jiras. Its helpful to even just make small steps
> forward.
>
> Kevin Risden
>
>
> On Wed, Mar 29, 2023 at 1:21 PM Alex Deparvu <stilla...@apache.org> wrote:
>
> > Hi,
> >
> > Following David's recent email about the failing tests (and obviously not
> > knowing any better) I started to look at the ShardSplitTest failures.
> >
> > The easy part was turning the current NPEs into an actual assertion. The
> > hard part is that the documents seem to sometimes be missing the
> _version_
> > field which causes the test to fail.
> > If anyone has more pointers into where the cause might be, please let me
> > know I would be happy to dig further down this rabbit hole.
> >
> > This is what I have so far https://github.com/apache/solr/pull/1504
> >
> > best,
> > alex
> >
>

Re: ShardSplitTest flakiness investigation

Reply via email to