An update on the failing test. The failure comes from documents having null version field from the add operations running in parallel with the split shard operation.
Digging into this a bit more (and running the test a very large number of times) I think I identified 3 problematic cases that can cause failures. will list them in order of importance. First, documents persisted with null version. I think this is by far the most important one to fix,and my PR is focused on closing this gap. The race condition seems to be when the new shard slice will receive pending updates from the old shard leader and persist them 'as is', but the trouble is some of these have zero version. so the docs will get persisted without version, breaking the test's consistency checks. My tentative fix is to allow the new slice to act as a master and attach a version. I also added a version integrity check that will not allow leader-received documents to be persisted without a version. I am considering moving this behind some sort of system property/flag to disable on demand if it turns out this is too strict. As far as testing is concerned: there was no need for more tests, the current one reproduces this issue easily (unfortunately it did not apply stricter assertions regarding errors on the parallel operations), so I made it a tiny bit stricter, in the hope that the test fails early and with a clearer message. I am looking for some thoughts and feedback on the PR itself but also on understanding the impact. This smells like a data corruption situation, but being reduced to the split shard operation, so maybe the window is not very big. please correct me if I am overstating the problem. Second. Another race condition happens on the deletes as well. but due to the first exception it is invisible. this one is milder, in the sense where deletes will be rejected, so less data corruption opportunity. here as well there is some opportunity to close this gap. I tried to do it as a part of this PR but I hit some weirdness, so instead of sitting on this more, I would prefer to have it reviewed and approach deletes in a subsequent PR. Third, I keep seeing yet another exception "Request says it is coming from parent shard leader but we are in active state", which seems to be more buffered requests on the old leader being sent over. I did not get any chance to look at this, so cannot comment on the impact yet. If anyone has interest (or some knowledge) in this area, please take a look at the PR and let me know your thoughts. PR https://github.com/apache/solr/pull/1504 best alex On Thu, Mar 30, 2023 at 8:06 AM Kevin Risden <kris...@apache.org> wrote: > I don't have any helpful pointers as to why this might be happening - but I > do want to say thanks for digging in and finding out the cause as well as > finding some related old jiras. Its helpful to even just make small steps > forward. > > Kevin Risden > > > On Wed, Mar 29, 2023 at 1:21 PM Alex Deparvu <stilla...@apache.org> wrote: > > > Hi, > > > > Following David's recent email about the failing tests (and obviously not > > knowing any better) I started to look at the ShardSplitTest failures. > > > > The easy part was turning the current NPEs into an actual assertion. The > > hard part is that the documents seem to sometimes be missing the > _version_ > > field which causes the test to fail. > > If anyone has more pointers into where the cause might be, please let me > > know I would be happy to dig further down this rabbit hole. > > > > This is what I have so far https://github.com/apache/solr/pull/1504 > > > > best, > > alex > > >