[ 
https://issues.apache.org/jira/browse/SOLR-13909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated SOLR-13909:
--------------------------------------
    Attachment: SOLR-13909.patch
        Status: Open  (was: Open)

Ok, well – this was a much deeper hole then i originally intended on going 
down, but I'm attaching a patch that i'm still beasting ... but I think it's 
pretty solid.

For the sake of my sanity, I replaced {{CheckBackupStatus}} with a new 
{{BackupStatusChecker}} so that i could gradually phase out the old API/usage 
one instance at a time.

I started with {{TestReplicationHandlerBackup}} since it was the biggest 
consumer of this API, and quickly determined that the only way to make it 
function sanely was to make some (backwards compatible) additions/tweaks to 
{{SnapShooter}} (below).

But as i feared, fixing these broken "always sleep 1000ms multiple times" loops 
around {{CheckBackupStatus}} exposed other race conditions in the test that 
also needed fixed.
----
Tweaks to {{SnapShooter}}:
 * the {{"details"}} of a successful backup now include the {{"directoryName"}} 
used (which is automatically generated for the un-named backup situation
 ** This allows {{BackupStatusChecker}} to offer a 
{{waitForDifferentBackupDir}} when dealing with unnamed backups
 * when using the {{numberToKeep}} option, the {{"details"}} of a backup aren't 
populated until after the "old" backups are deleted
 ** this fixed a test race condition trying to confirm that {{numberToKeep}} 
was respected when taking a bakup
 * the {{"details"}} of _deleting_ a named backup (which is aparently an 
un-documented ReplicationHandler {{"command"}} that we have tests for) now 
include the {{"snapshotName"}} of that backup
 ** which let me also replace the similarly broken {{CheckDeleteBackupStatus}} 
class i found hidden inside {{TestReplicationHandlerBackup}}

Improvements to {{TestReplicationHandlerBackup}}:
 * Eliminated the need for a lot of 
{{Files.newDirectoryStream(Paths.get(master.getDataDir()), "snapshot*")}} code 
patterns
 ** This was a happy side effect of making the backup {{"details"}} include the 
{{"directoryName"}}
 ** we can now assert explicitly that directory name returned exists and is a 
"valid" backup, instead of searching for a glob and iterating over a 
DirectoryStream and hoping we find hte right now.
 ** This also fixed a bug in the test where it assumed 
{{Files.newDirectoryStream(...)}} was going to return files in timestamp order 
when there were multiple backups.

----
The fixes to the other test classes were generally much simpler and more 
straight forward....

...with the notable exception of {{TestRestoreCore.testFailedRestore}} which 
had a "loop" where it was expecting a helper method to throw an 
{{AssertionError}} inside of a {{try/catch(AssertionError)}} block, but if it 
didn't then it called {{fail()}} inside of that same try block ... so no matter 
what the helper method did the test was gong to "pass" ... i fixed it to use 
{{expectThrows()}}

> Everything about CheckBackupStatus is broken
> --------------------------------------------
>
>                 Key: SOLR-13909
>                 URL: https://issues.apache.org/jira/browse/SOLR-13909
>             Project: Solr
>          Issue Type: Test
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Assignee: Chris M. Hostetter
>            Priority: Major
>         Attachments: SOLR-13909.patch
>
>
> While working on SOLR-13872 I tried to take advantage of the existing 
> {{CheckBackupStatus}} helper class and discovered that just about every 
> aspect of this class is broken and needs fixed:
>  * doesn't use SolrClient, pulls out it's URL to do a bare HTTP request
>  * hardcoded assumption of xml - but doesn't parse it just tries to match 
> regexes against it
>  * almost every usage of this class follows the same broken "loop" pattern 
> that garuntees the test will sleep more then it needs to even after 
> {{CheckBackupStatus}} thinks the backup is a success...
> {code:java}
>     CheckBackupStatus checkBackupStatus = new CheckBackupStatus(...);
>     while (!checkBackupStatus.success) {
>       checkBackupStatus.fetchStatus();
>       Thread.sleep(1000);
>     }
> {code}
>  * the 3 arg constructor is broken both in design and in implementation:
>  ** it appears to be useful for checking that a _new_ backup has succeeded 
> after a {{lastBackupTimestamp}} from some previously successful check
>  ** in reality it only ever reports {{success}} if it's status check 
> indicates the most recent backup has the exact {{.equals()}} time stamp as 
> {{lastBackupTimestamp}}
>  ** *AND THESE TIMESTAMPS ONLY HAVE MINUTE PRECISION*
>  ** As far as i can tell, the only the tests using the 3 arg version ever 
> pass is because of the broken loop pattern:
>  *** they ask for the status so quick, it's either already done (during the 
> same wall clock minute) or it's not done yet and they re-read the "old" 
> status (with the old matching timestamp)
>  *** either way, the test then sleeps for a second giving the "new" backup 
> enough time to actually finish
>  ** AFAICT if the System clock ticks over to a new minute in between these 
> sleep calls, the test is garunteed to loop forever!
> ----
> Everything about this class needs to die and be replaced with something 
> better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to