[ https://issues.apache.org/jira/browse/SOLR-13781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933874#comment-16933874 ]
ASF subversion and git services commented on SOLR-13781: -------------------------------------------------------- Commit 5a01a8b3622cf7547e71fa43d88235aeb18defa4 in lucene-solr's branch refs/heads/master from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5a01a8b ] SOLR-13781: AwaitsFix TestContainerReqHandler.testPackageAPI > TestContainerReqHandler.testPackageAPI failures imply race condition between > update-package and delete-requesthandler > --------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-13781 > URL: https://issues.apache.org/jira/browse/SOLR-13781 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Hoss Man > Priority: Major > Attachments: apache_Lucene-Solr-Tests-8.x_587.testPackageAPI.txt, > egrep-out.apache_Lucene-Solr-Tests-8.x_587.testPackageAPI.txt, > egrep-out.local.log.txt, local.log.txt > > > We're seeing roughly an 8% failure rate from > {{TestContainerReqHandler.testPackageAPI}} with failures occuring on both > master and branch_8x, and on various jenkins servers and various OSes. > All of the failures occur at the same place: A V2 request to {{/node/ext}} to > verify that that the {{requestHandler}} List is empty after issuing a > {{delete-requesthandler: 'bar'}} payload to the {{/cluster}} API. The logs > and failure message indicate that the {{'bar'}} request handler still exists > even the assertion does a "sleep/retry" of the verification query 10 times. > While i don't fully understand this test, or the underlying code being > tested, i spent a little time digging into the logs from some of these > jenkins failures, and comparing them to the logs i see generated when i get a > successful test run locally, and I think what's happening here - and the > reason that {{delete-requesthandler}} seems to "fail" frequently in this test > method, but not in {{testSetClusterReqHandler}} - is because the prior > {{update-package}} command is still in process. > After the test code runs an {{update-package}} command, the test executes > requests against {{/node/ext/bar}} to verify that the {{version}} has changed > as a result of updating the package, but i suspect this is only looking at > the _metadata_ that has changed as a result of the {{update-package}} command > and not actaully ensuring that the request handler has fully loaded - because > the logs when this test fails seem to show that the zkCallback threads kicked > off by {{update-package}} command are still running when the zkCallback > threads kicked off by the subsequent {{delete-requesthandler}} command are > running, and finish *after* them, "re-registering" the handler that was just > deleted. > ---- > It's not 100% clear to me if this is _just_ a test bug - and it should be > monitoring something else to know when the request handler's a finished > loading - or if this indicates a broader flaw in the design of how commands > like {{add-package}} / {{update-package}} / {{add-requesthandler}} / > {{delete-requesthandler}} should interact if/when they occur in close > temporal proximity. > (ie: if there are zkCallback watchers loading classes and initializing > objects as a result of cluster property changes, shouldn't there be some sort > of lineraization/synchronization logic to ensure that they get executed in > the same order on all the nodes in the cluster?) > ---- > More detail and log file attachments to follow... -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org