[
https://issues.apache.org/jira/browse/SOLR-7609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707396#comment-17707396
]
Alex Deparvu commented on SOLR-7609:
------------------------------------
(pasting my PR comment here as well, for future reference
https://github.com/apache/solr/pull/1504#issuecomment-1492441639)
I have pushed a tentative patch for review and more importantly for discussion.
I have ran this test extensively (hundreds of beast jobs) and it looks like
this is the way to go, all 'add' exceptions are gone once the isLeader flag is
updated. BUT I am not completely convinced this is the correct patch. possibly
something could be made better in the DistributedZkUpdateProcessor area instead
(where a lot of leader calculations are being made but the result is not
correct in this edge case).
While I am still evaluating a patch for DistributedZkUpdateProcessor, I would
love some feedback on the observations so far.
More on the isLeader update: I have decided to only do this in the case where
it can be used to avoid throwing an exception. this was to add a best-effort
way correct an inconsistent state before bailing out, possibly a cleaner fix
would compute the 'isLeader' flag in a way that this approach is no longer
needed.
Together with the patch I have added a check which will not allow additions
with no version. this is now consistent with the delete flows. although I must
admit I am confused by the fact that this check was missing on this execution
path, while the delete path had this from at least 8 years ago.
Other points
* in the current test, there are a few operations (add and deletes) that run
parallel to the split operation. these can fail and this is where the bug is.
because the exceptions are not correctly accounted for, the NPE popped up at a
weird place. I added tracking and assertions on this part too (except for
deletes, see below)
* deletes can also cause exceptions, I don't have a good handle on how to fix
this, so I removed the strict error check. this can be added once the
confidence level is higher and this flow is accounted for. until then I will
focus on the addition parts.
* DistributedZkUpdateProcessor will call setupRequest twice. this is probably a
minor nuisance for some future improvement PR.
* finding the code very hard to read I took the decision to remove one variable
that is not really needed (the updateCommand), this change is purely cosmetic
and I can revert it if it adds too much noise to the current patch.
> ShardSplitTest NPE
> ------------------
>
> Key: SOLR-7609
> URL: https://issues.apache.org/jira/browse/SOLR-7609
> Project: Solr
> Issue Type: Bug
> Reporter: Steven Rowe
> Priority: Minor
> Attachments: ShardSplitTest.NPE.log
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> I'm guessing this is a test bug, but the seed doesn't reproduce for me (tried
> on the same Linux machine it occurred on and on OS X):
> {noformat}
> [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=ShardSplitTest
> -Dtests.method=test -Dtests.seed=9318DDA46578ECF9 -Dtests.slow=true
> -Dtests.locale=is -Dtests.timezone=America/St_Vincent -Dtests.asserts=true
> -Dtests.file.encoding=US-ASCII
> [junit4] ERROR 55.8s J6 | ShardSplitTest.test <<<
> [junit4] > Throwable #1: java.lang.NullPointerException
> [junit4] > at
> __randomizedtesting.SeedInfo.seed([9318DDA46578ECF9:1B4CE27ECB848101]:0)
> [junit4] > at
> org.apache.solr.cloud.ShardSplitTest.logDebugHelp(ShardSplitTest.java:547)
> [junit4] > at
> org.apache.solr.cloud.ShardSplitTest.checkDocCountsAndShardStates(ShardSplitTest.java:438)
> [junit4] > at
> org.apache.solr.cloud.ShardSplitTest.splitByUniqueKeyTest(ShardSplitTest.java:222)
> [junit4] > at
> org.apache.solr.cloud.ShardSplitTest.test(ShardSplitTest.java:84)
> [junit4] > at
> org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsFixedStatement.callStatement(BaseDistributedSearchTestCase.java:960)
> [junit4] > at
> org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsStatement.evaluate(BaseDistributedSearchTestCase.java:935)
> [junit4] > at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Line 547 of {{ShardSplitTest.java}} is:
> {code:java}
> idVsVersion.put(document.getFieldValue("id").toString(),
> document.getFieldValue("_version_").toString());
> {code}
> Skimming the code, it's not obvious what could be null.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]