[ 
https://issues.apache.org/jira/browse/SOLR-10109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-10109:
----------------------------
    Attachment: SoftAutoCommitTest.2017-02-25.2938.log.txt
                SoftAutoCommitTest.2017-05-02.3983.log.txt


[~markrmil...@gmail.com]: Since your commit in feb, the number of jenkins 
failures for SoftAutoCommitTest has dropped noticeably -- well done!

There have only been 2 jenkins failures, since then -- I don't have the full 
logs, but i'm attaching the bodies of the jenkins emails.  In both cases, the 
failure was the same assertion in {{testSoftAndHardCommitMaxTimeRapidAdds}}...

{noformat}
java.lang.AssertionError: 2: hard wasn't fast enough
        at 
__randomizedtesting.SeedInfo.seed([FF159C05808D08C6:A300323C6B0F49BE]:0)
        at org.junit.Assert.fail(Assert.java:93)
        at org.junit.Assert.assertTrue(Assert.java:43)
        at org.junit.Assert.assertNotNull(Assert.java:526)
        at 
org.apache.solr.update.SoftAutoCommitTest.testSoftAndHardCommitMaxTimeRapidAdds(SoftAutoCommitTest.java:382)
{noformat}

If you look at the timing info from the logs, and do some simple math, the 
problem seems clear...

{noformat}
   [junit4]   2> 2824567 INFO  
(TEST-SoftAutoCommitTest.testSoftAndHardCommitMaxTimeRapidAdds-seed#[FF159C05808D08C6])
 [    ] o.a.s.SolrTestCaseJ4 ###Starting testSoftAndHardCommitMaxTimeRapidAdds
   [junit4]   2> 2832001 INFO  
(TEST-SoftAutoCommitTest.testSoftAndHardCommitMaxTimeRapidAdds-seed#[FF159C05808D08C6])
 [    ] o.a.s.u.p.LogUpdateProcessorFactory [collection1]  webapp=null 
path=null params={}{add=[5000 (1566326219147837440)]} 0 2424
   [junit4]   2> 2832002 INFO  
(TEST-SoftAutoCommitTest.testSoftAndHardCommitMaxTimeRapidAdds-seed#[FF159C05808D08C6])
 [    ] o.a.s.u.p.LogUpdateProcessorFactory [collection1]  webapp=null 
path=null params={}{add=[5001 (1566326221690634240)]} 0 0
   [junit4]   2> 2832003 INFO  
(TEST-SoftAutoCommitTest.testSoftAndHardCommitMaxTimeRapidAdds-seed#[FF159C05808D08C6])
 [    ] o.a.s.u.p.LogUpdateProcessorFactory [collection1]  webapp=null 
path=null params={}{add=[5002 (1566326221690634241)]} 0 0
   [junit4]   2> 2832003 INFO  
(TEST-SoftAutoCommitTest.testSoftAndHardCommitMaxTimeRapidAdds-seed#[FF159C05808D08C6])
 [    ] o.a.s.u.p.LogUpdateProcessorFactory [collection1]  webapp=null 
path=null params={}{add=[5003 (1566326221691682816)]} 0 0
   [junit4]   2> 2832004 INFO  
(TEST-SoftAutoCommitTest.testSoftAndHardCommitMaxTimeRapidAdds-seed#[FF159C05808D08C6])
 [    ] o.a.s.u.p.LogUpdateProcessorFactory [collection1]  webapp=null 
path=null params={}{add=[5004 (1566326221691682817)]} 0 0
   [junit4]   2> 2832506 INFO  (commitScheduler-7474-thread-1) [    ] 
o.a.s.u.DirectUpdateHandler2 start 
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}
   [junit4]   2> 2832509 INFO  (commitScheduler-7474-thread-1) [    ] 
o.a.s.s.SolrIndexSearcher Opening [Searcher@1d85065c[collection1] main]
   [junit4]   2> 2832509 INFO  (commitScheduler-7474-thread-1) [    ] 
o.a.s.u.DirectUpdateHandler2 end_commit_flush
   [junit4]   2> 2832511 INFO  (searcherExecutor-7470-thread-1) [    ] 
o.a.s.c.SolrCore [collection1] Registered new searcher 
Searcher@1d85065c[collection1] 
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_1(7.0.0):c1)
 Uninverting(_2(7.0.0):c1) Uninverting(_3(7.0.0):c1) 
Uninverting(_4(7.0.0):c5)))}
   [junit4]   2> 2833207 INFO  (commitScheduler-7473-thread-1) [    ] 
o.a.s.u.DirectUpdateHandler2 start 
commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
{noformat}

If you compare the timstamp when the first doc *finished* being added (2832001) 
with the timstamp of when the hard commit *started* (2833207) then that's 
almost exactly the 1200ms hardCommitWaitMillis ({{2833207 - 2832001 = 1206}}) 
and it's understandable that only 1 hard commit happened since all of the docs 
were added in under 12 seconds.  But the test requires the {{fast5start}} 
timestamp _before_ adding the first document, and as we can see from the 
{{add=[5000 (1566326219147837440)]} 0 2424}} log line that request was stalled 
for 2.4seconds -- which probably explains why the test thread thinks the 5 adds 
took longer then 12seconds and expects more then 1 hardCommit.  (it's not clear 
since the test does actaully log fast5start or fast5end)


I think the right fix is to do something like this...

{code}
// record pre/post times for start/stop to account for flux in when add
// recorded by commit tracker for first/last doc in the set
final long preStart = System.nanoTime();
assertU(adoc("id",5000));
final long postStart = System.nanoTime();
assertU(adoc("id",5001));
assertU(adoc("id",5002));
assertU(adoc("id",5003));
final long preEnd = System.nanoTime();
assertU(adoc("id",5004));
final long postEnd = System.nanoTime();

final long minTime = preEnd - postStart;
final long maxTime = postEnd - preStart;

// NOTE: explicitly useing truncated division of longs to round down for "min" 
values
final long minExpectedSoft = Math.max(1, minTime / softCommitWaitMillis);
final long minExpectedHard = Math.max(1, minTime / hardCommitWaitMillis);

// round up for "max" values
final long maxExpectedSoft = Math.max(1, Math.ceil(((double)maxTime) / 
softCommitWaitMillis);
final long maxExpectedHard = Math.max(1, Math.ceil(((double)maxTime) / 
hardCommitWaitMillis);

// TODO....
// now assert we can poll each type for the minExpectedFoo and not get nulls
// then accept that we *might* have up to 'maxExpectedFoo-minExpectedFoo' 
non-null values as well
{code}


Unless I'm missing something, that would preserve the spirit of the test, and 
account for possible slow machine lag as well.  I'll start working on a patch 
(and include some nocommit random sleeps to help test for the possibility of 
this lag) but in the meantime if you see any flaws in this approach please ping 
me ASAP.




> SoftAutoCommitTest is too fragile.
> ----------------------------------
>
>                 Key: SOLR-10109
>                 URL: https://issues.apache.org/jira/browse/SOLR-10109
>             Project: Solr
>          Issue Type: Test
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>         Attachments: SoftAutoCommitTest.2017-02-25.2938.log.txt, 
> SoftAutoCommitTest.2017-05-02.3983.log.txt
>
>
> In my beast test runs it has failed 7% of the time, then 3% of the time, then 
> when I started using hardware with 8 cores instead of 16, 40% of the time 
> (unless something else also shifted). That gives it a test rating of 'screwy'.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to