[ https://issues.apache.org/jira/browse/SOLR-13909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris M. Hostetter updated SOLR-13909: -------------------------------------- Attachment: SOLR-13909.patch Status: Open (was: Open) Ok, well – this was a much deeper hole then i originally intended on going down, but I'm attaching a patch that i'm still beasting ... but I think it's pretty solid. For the sake of my sanity, I replaced {{CheckBackupStatus}} with a new {{BackupStatusChecker}} so that i could gradually phase out the old API/usage one instance at a time. I started with {{TestReplicationHandlerBackup}} since it was the biggest consumer of this API, and quickly determined that the only way to make it function sanely was to make some (backwards compatible) additions/tweaks to {{SnapShooter}} (below). But as i feared, fixing these broken "always sleep 1000ms multiple times" loops around {{CheckBackupStatus}} exposed other race conditions in the test that also needed fixed. ---- Tweaks to {{SnapShooter}}: * the {{"details"}} of a successful backup now include the {{"directoryName"}} used (which is automatically generated for the un-named backup situation ** This allows {{BackupStatusChecker}} to offer a {{waitForDifferentBackupDir}} when dealing with unnamed backups * when using the {{numberToKeep}} option, the {{"details"}} of a backup aren't populated until after the "old" backups are deleted ** this fixed a test race condition trying to confirm that {{numberToKeep}} was respected when taking a bakup * the {{"details"}} of _deleting_ a named backup (which is aparently an un-documented ReplicationHandler {{"command"}} that we have tests for) now include the {{"snapshotName"}} of that backup ** which let me also replace the similarly broken {{CheckDeleteBackupStatus}} class i found hidden inside {{TestReplicationHandlerBackup}} Improvements to {{TestReplicationHandlerBackup}}: * Eliminated the need for a lot of {{Files.newDirectoryStream(Paths.get(master.getDataDir()), "snapshot*")}} code patterns ** This was a happy side effect of making the backup {{"details"}} include the {{"directoryName"}} ** we can now assert explicitly that directory name returned exists and is a "valid" backup, instead of searching for a glob and iterating over a DirectoryStream and hoping we find hte right now. ** This also fixed a bug in the test where it assumed {{Files.newDirectoryStream(...)}} was going to return files in timestamp order when there were multiple backups. ---- The fixes to the other test classes were generally much simpler and more straight forward.... ...with the notable exception of {{TestRestoreCore.testFailedRestore}} which had a "loop" where it was expecting a helper method to throw an {{AssertionError}} inside of a {{try/catch(AssertionError)}} block, but if it didn't then it called {{fail()}} inside of that same try block ... so no matter what the helper method did the test was gong to "pass" ... i fixed it to use {{expectThrows()}} > Everything about CheckBackupStatus is broken > -------------------------------------------- > > Key: SOLR-13909 > URL: https://issues.apache.org/jira/browse/SOLR-13909 > Project: Solr > Issue Type: Test > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Chris M. Hostetter > Assignee: Chris M. Hostetter > Priority: Major > Attachments: SOLR-13909.patch > > > While working on SOLR-13872 I tried to take advantage of the existing > {{CheckBackupStatus}} helper class and discovered that just about every > aspect of this class is broken and needs fixed: > * doesn't use SolrClient, pulls out it's URL to do a bare HTTP request > * hardcoded assumption of xml - but doesn't parse it just tries to match > regexes against it > * almost every usage of this class follows the same broken "loop" pattern > that garuntees the test will sleep more then it needs to even after > {{CheckBackupStatus}} thinks the backup is a success... > {code:java} > CheckBackupStatus checkBackupStatus = new CheckBackupStatus(...); > while (!checkBackupStatus.success) { > checkBackupStatus.fetchStatus(); > Thread.sleep(1000); > } > {code} > * the 3 arg constructor is broken both in design and in implementation: > ** it appears to be useful for checking that a _new_ backup has succeeded > after a {{lastBackupTimestamp}} from some previously successful check > ** in reality it only ever reports {{success}} if it's status check > indicates the most recent backup has the exact {{.equals()}} time stamp as > {{lastBackupTimestamp}} > ** *AND THESE TIMESTAMPS ONLY HAVE MINUTE PRECISION* > ** As far as i can tell, the only the tests using the 3 arg version ever > pass is because of the broken loop pattern: > *** they ask for the status so quick, it's either already done (during the > same wall clock minute) or it's not done yet and they re-read the "old" > status (with the old matching timestamp) > *** either way, the test then sleeps for a second giving the "new" backup > enough time to actually finish > ** AFAICT if the System clock ticks over to a new minute in between these > sleep calls, the test is garunteed to loop forever! > ---- > Everything about this class needs to die and be replaced with something > better. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org