Re: new candidate list for BadApple-ing on Saturday, 17-Mar

2018-03-15 Thread Erick Erickson
Sure, the Lucene tests will _not_ be BadApple-d on Saturday.

Any of those tests that do _not_ fail according to Hoss' and Mark's
projects on Friday/Saturday will NOT be annotated anyway.

On Thu, Mar 15, 2018 at 2:24 AM, Alan Woodward  wrote:
> The lucene failures might well be linked to
> https://issues.apache.org/jira/browse/LUCENE-8203 - let’s see if they drop
> off the list.
>
> On 14 Mar 2018, at 21:52, Erick Erickson  wrote:
>
> We had a drop off in the number of failing tests over the last couple
> of days, so I'm
> going to ignore the fails 11-12 Mar. Or it was a temporary increase
> for those days, take your
> pick ;)
>
>
> I'll check against Hoss' and Mark's lists and only BadApple tests
> that've failed Friday
> or later.
>
> In particular what do the Lucene folks think about the two Lucene tests?
>
>
> junit.framework.TestSuite.org.apache.solr.cloud.TestLeaderElectionZkExpiry
>
> org.apache.lucene.index.TestIndexSorting.testRandom3
> org.apache.lucene.index.TestIndexWriterWithThreads.testCloseWithThreads
>
> org.apache.solr.cloud.api.collections.CollectionsAPIAsyncDistributedZkTest.testAsyncIdBackCompat
> org.apache.solr.cloud.autoscaling.ComputePlanActionTest.testNodeWithMultipleReplicasLost
> org.apache.solr.cloud.autoscaling.ScheduledMaintenanceTriggerTest.testInactiveShardCleanup
> org.apache.solr.cloud.ConcurrentCreateRoutedAliasTest.testConcurrentCreateRoutedAliasComplex
> org.apache.solr.cloud.DocValuesNotIndexedTest.testGroupingDVOnly
> org.apache.solr.cloud.LeaderVoteWaitTimeoutTest.basicTest
> org.apache.solr.cloud.LeaderVoteWaitTimeoutTest.testMostInSyncReplicasCanWinElection
> org.apache.solr.cloud.MoveReplicaHDFSTest.testFailedMove
> org.apache.solr.cloud.SSLMigrationTest.test
> org.apache.solr.cloud.TestTlogReplica.testCreateDelete
> org.apache.solr.handler.admin.SegmentsInfoRequestHandlerTest.testSegmentInfosVersion
> org.apache.solr.handler.TestSolrConfigHandlerCloud.test
> org.apache.solr.logging.TestLogWatcher.testLog4jWatcher
> org.apache.solr.spelling.SpellCheckCollatorTest.testEstimatedHitCounts
>
>
> Fails by day are below for reference. If they're _not_ listed above, they'll
> be
> left to run for another week:
>
> 11-Mar fails:
>
> junit.framework.TestSuite.org.apache.solr.cloud.BasicDistributedZkTestream.StreamExpressionTest.testGammaDistribution
> junit.framework.TestSuite.org.apache.solr.cloud.TestCloudPivotFacet
> junit.framework.TestSuite.org.apache.solr.cloud.TriLevelCompositeIdRoutingTest
> junit.framework.TestSuite.org.apache.solr.cloud.ZkControllerTest
> junit.framework.TestSuite.org.apache.solr.search.join.BlockJoinFacetDistribTest
> org.apache.solr.client.solrj.io.storg.apache.solr.cloud.BasicDistributedZkTest.test
> org.apache.solr.cloud.autoscaling.ScheduledMaintenanceTriggerTest.testInactiveShardCleanup
> org.apache.solr.cloud.BasicDistributedZkTest.test
> org.apache.solr.cloud.FullSolrCloudDistribCmdsTest
> org.apache.solr.cloud.FullSolrCloudDistribCmdsTest.test
> org.apache.solr.cloud.hdfs.HdfsUnloadDistributedZkTest
> org.apache.solr.cloud.hdfs.HdfsUnloadDistributedZkTest.test
> org.apache.solr.cloud.hdfs.StressHdfsTest.test
> org.apache.solr.cloud.MoveReplicaHDFSTest.testFailedMove
> org.apache.solr.cloud.TestCloudPivotFacet.test
> org.apache.solr.cloud.TriLevelCompositeIdRoutingTest.test
> org.apache.solr.logging.TestLogWatcher.testLog4jWatcher
>
> 12-Mar fails:
> junit.framework.TestSuite.org.apache.lucene.search.spans.TestSpanSearchEquivalence
> junit.framework.TestSuite.org.apache.solr.cloud.autoscaling.TriggerIntegrationTest
> junit.framework.TestSuite.org.apache.solr.cloud.BasicDistributedZkTest
> junit.framework.TestSuite.org.apache.solr.cloud.hdfs.HdfsUnloadDistributedZkTest
> junit.framework.TestSuite.org.apache.solr.cloud.TestLeaderElectionZkExpiry
> org.apache.lucene.index.TestDuelingCodecsAtNight.testBigEquals
> org.apache.lucene.search.spans.TestSpanSearchEquivalence.testSpanNearIncreasingSloppiness
> org.apache.solr.cloud.autoscaling.AutoAddReplicasIntegrationTest
> org.apache.solr.cloud.autoscaling.HdfsAutoAddReplicasIntegrationTest
> org.apache.solr.cloud.autoscaling.ScheduledMaintenanceTriggerTest
> org.apache.solr.cloud.autoscaling.ScheduledMaintenanceTriggerTest.testInactiveShardCleanup
> org.apache.solr.cloud.autoscaling.TriggerIntegrationTest.testEventQueue
> org.apache.solr.cloud.autoscaling.TriggerIntegrationTest.testMetricTrigger
> org.apache.solr.cloud.BasicDistributedZkTest.test
> org.apache.solr.cloud.CollectionsAPISolrJTest.testSplitShard
> org.apache.solr.cloud.LeaderVoteWaitTimeoutTest.testMostInSyncReplicasCanWinElection
> org.apache.solr.cloud.MoveReplicaHDFSTest.testFailedMove
> org.apache.solr.cloud.SSLMigrationTest.test
> org.apache.solr.cloud.TestRandomFlRTGCloud.testRandomizedUpdatesAndRTGs
> org.apache.solr.core.TestJmxIntegration
> org.apache.solr.handler.admin.AutoscalingHistoryHandlerTest
> org.apache.solr.handler.TestReplicationHandler
> org.apache.solr.handler.TestReplicationHandler.doTestReplicateAfterCoreRel

Re: new candidate list for BadApple-ing on Saturday, 17-Mar

2018-03-15 Thread Alan Woodward
The lucene failures might well be linked to 
https://issues.apache.org/jira/browse/LUCENE-8203 
 - let’s see if they drop 
off the list.

> On 14 Mar 2018, at 21:52, Erick Erickson  > wrote:
> 
> We had a drop off in the number of failing tests over the last couple
> of days, so I'm
> going to ignore the fails 11-12 Mar. Or it was a temporary increase
> for those days, take your
> pick ;)
> 
> 
> I'll check against Hoss' and Mark's lists and only BadApple tests
> that've failed Friday
> or later.
> 
> In particular what do the Lucene folks think about the two Lucene tests?
> 
> 
> junit.framework.TestSuite.org.apache.solr.cloud.TestLeaderElectionZkExpiry
> 
> org.apache.lucene.index.TestIndexSorting.testRandom3
> org.apache.lucene.index.TestIndexWriterWithThreads.testCloseWithThreads
> 
> org.apache.solr.cloud.api.collections.CollectionsAPIAsyncDistributedZkTest.testAsyncIdBackCompat
> org.apache.solr.cloud.autoscaling.ComputePlanActionTest.testNodeWithMultipleReplicasLost
> org.apache.solr.cloud.autoscaling.ScheduledMaintenanceTriggerTest.testInactiveShardCleanup
> org.apache.solr.cloud.ConcurrentCreateRoutedAliasTest.testConcurrentCreateRoutedAliasComplex
> org.apache.solr.cloud.DocValuesNotIndexedTest.testGroupingDVOnly
> org.apache.solr.cloud.LeaderVoteWaitTimeoutTest.basicTest
> org.apache.solr.cloud.LeaderVoteWaitTimeoutTest.testMostInSyncReplicasCanWinElection
> org.apache.solr.cloud.MoveReplicaHDFSTest.testFailedMove
> org.apache.solr.cloud.SSLMigrationTest.test
> org.apache.solr.cloud.TestTlogReplica.testCreateDelete
> org.apache.solr.handler.admin.SegmentsInfoRequestHandlerTest.testSegmentInfosVersion
> org.apache.solr.handler.TestSolrConfigHandlerCloud.test
> org.apache.solr.logging.TestLogWatcher.testLog4jWatcher
> org.apache.solr.spelling.SpellCheckCollatorTest.testEstimatedHitCounts
> 
> 
> Fails by day are below for reference. If they're _not_ listed above, they'll 
> be
> left to run for another week:
> 
> 11-Mar fails:
> 
> junit.framework.TestSuite.org.apache.solr.cloud.BasicDistributedZkTestream.StreamExpressionTest.testGammaDistribution
> junit.framework.TestSuite.org.apache.solr.cloud.TestCloudPivotFacet
> junit.framework.TestSuite.org.apache.solr.cloud.TriLevelCompositeIdRoutingTest
> junit.framework.TestSuite.org.apache.solr.cloud.ZkControllerTest
> junit.framework.TestSuite.org.apache.solr.search.join.BlockJoinFacetDistribTest
> org.apache.solr.client.solrj.io.storg.apache.solr.cloud.BasicDistributedZkTest.test
> org.apache.solr.cloud.autoscaling.ScheduledMaintenanceTriggerTest.testInactiveShardCleanup
> org.apache.solr.cloud.BasicDistributedZkTest.test
> org.apache.solr.cloud.FullSolrCloudDistribCmdsTest
> org.apache.solr.cloud.FullSolrCloudDistribCmdsTest.test
> org.apache.solr.cloud.hdfs.HdfsUnloadDistributedZkTest
> org.apache.solr.cloud.hdfs.HdfsUnloadDistributedZkTest.test
> org.apache.solr.cloud.hdfs.StressHdfsTest.test
> org.apache.solr.cloud.MoveReplicaHDFSTest.testFailedMove
> org.apache.solr.cloud.TestCloudPivotFacet.test
> org.apache.solr.cloud.TriLevelCompositeIdRoutingTest.test
> org.apache.solr.logging.TestLogWatcher.testLog4jWatcher
> 
> 12-Mar fails:
> junit.framework.TestSuite.org.apache.lucene.search.spans.TestSpanSearchEquivalence
> junit.framework.TestSuite.org.apache.solr.cloud.autoscaling.TriggerIntegrationTest
> junit.framework.TestSuite.org.apache.solr.cloud.BasicDistributedZkTest
> junit.framework.TestSuite.org.apache.solr.cloud.hdfs.HdfsUnloadDistributedZkTest
> junit.framework.TestSuite.org.apache.solr.cloud.TestLeaderElectionZkExpiry
> org.apache.lucene.index.TestDuelingCodecsAtNight.testBigEquals
> org.apache.lucene.search.spans.TestSpanSearchEquivalence.testSpanNearIncreasingSloppiness
> org.apache.solr.cloud.autoscaling.AutoAddReplicasIntegrationTest
> org.apache.solr.cloud.autoscaling.HdfsAutoAddReplicasIntegrationTest
> org.apache.solr.cloud.autoscaling.ScheduledMaintenanceTriggerTest
> org.apache.solr.cloud.autoscaling.ScheduledMaintenanceTriggerTest.testInactiveShardCleanup
> org.apache.solr.cloud.autoscaling.TriggerIntegrationTest.testEventQueue
> org.apache.solr.cloud.autoscaling.TriggerIntegrationTest.testMetricTrigger
> org.apache.solr.cloud.BasicDistributedZkTest.test
> org.apache.solr.cloud.CollectionsAPISolrJTest.testSplitShard
> org.apache.solr.cloud.LeaderVoteWaitTimeoutTest.testMostInSyncReplicasCanWinElection
> org.apache.solr.cloud.MoveReplicaHDFSTest.testFailedMove
> org.apache.solr.cloud.SSLMigrationTest.test
> org.apache.solr.cloud.TestRandomFlRTGCloud.testRandomizedUpdatesAndRTGs
> org.apache.solr.core.TestJmxIntegration
> org.apache.solr.handler.admin.AutoscalingHistoryHandlerTest
> org.apache.solr.handler.TestReplicationHandler
> org.apache.solr.handler.TestReplicationHandler.doTestReplicateAfterCoreReload
> org.apache.solr.logging.TestLogWatcher.testLog4jWatcher
> org.apache.solr.ltr.TestLTRReRankingPipeline
> org.apache.solr.TestDistributedSearch

Re: new candidate list for BadApple-ing on Saturday, 17-Mar

2018-03-15 Thread Dawid Weiss
> org.apache.lucene.index.TestIndexWriterWithThreads.testCloseWithThreads

I know what the problem here is (wall-time).  I'll look into fixing it.
https://issues.apache.org/jira/browse/LUCENE-8206

D.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: new candidate list for BadApple-ing on Saturday, 17-Mar

2018-03-14 Thread Erick Erickson
Gus:

Nice sleuthing!

The problem has been that there has been s much noise that
it's been almost impossible, for me at least, to get motivated to dig
into them. I certainly suspect systemic issues like this are behind
some of our intermittent failures, so I'm very glad you took the
initiative to dig.

If you see some test on the list that you want to keep running, please
just let me know. It's perfectly OK to have a test fail intermittently
if someone's actively looking at why, means it isn't lost in the
noise. And often I _can't_ get tests to fail locally no matter how
many times I run them so having them run on Jenkins is the only way to
pursue them.

My long-term goal here is to get the noise level down to tolerable
levels. Once that happens, an ongoing task will be to un-BaApple tests
that no longer fail as demonstrated by Hoss' roll-up and Mark's
beasting projects and see if we can get that test coverage back. This
whole BadApple thing is just a stop-gap to get to a stable point and
then start getting the coverage back

Best,
Erick



On Wed, Mar 14, 2018 at 8:36 PM, Mark Miller  wrote:
> Can you file a JIRA issue?
>
> In general, we want to be super forgiving by default and only harsh for a
> specific test that might demand it.
>
> - Mark
>
> On Wed, Mar 14, 2018 at 9:45 PM Gus Heck  wrote:
>>
>> Beginning to answer my own question...
>>
>> public abstract class AbstractZkTestCase extends SolrTestCaseJ4 {
>>   private static final String ZOOKEEPER_FORCE_SYNC =
>> "zookeeper.forceSync";
>>
>>   public static final int TIMEOUT = 45000;
>>
>> seems to be at least one place we are probably not getting what we
>> expect... the server's going to cut that back to the 20 second max...
>>
>> On Wed, Mar 14, 2018 at 10:36 PM, Gus Heck  wrote:
>>>
>>> Being slightly irritated by the fact one of my tests shows up in this, I
>>> did some digging and I found
>>>
>>>[junit4]   2> 485679 ERROR
>>> (OverseerThreadFactory-788-thread-1-processing-n:127.0.0.1:35771_solr)
>>> [n:127.0.0.1:35771_solr] o.a.s.c.a.c.OverseerCollectionMessageHandler
>>> Collection: testAliasCplx0 operation: createalias
>>> failed:org.apache.zookeeper.KeeperException$SessionExpiredException:
>>> KeeperErrorCode = Session expired for /configs/_default
>>>[junit4]   2>at
>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
>>>[junit4]   2>at
>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>>>[junit4]   2>at
>>> org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1105)
>>>[junit4]   2>at
>>> org.apache.solr.common.cloud.SolrZkClient.lambda$exists$3(SolrZkClient.java:316)
>>>[junit4]   2>at
>>> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
>>>[junit4]   2>at
>>> org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:316)
>>>[junit4]   2>at
>>> org.apache.solr.client.solrj.impl.ZkDistribStateManager.hasData(ZkDistribStateManager.java:58)
>>>[junit4]   2>at
>>> org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.validateConfigOrThrowSolrException(OverseerCollectionMessageHandler.java:737)
>>>[junit4]   2>at
>>> org.apache.solr.cloud.api.collections.CreateCollectionCmd.call(CreateCollectionCmd.java:114)
>>>[junit4]   2>at
>>> org.apache.solr.cloud.api.collections.MaintainRoutedAliasCmd.createCollectionAndWait(MaintainRoutedAliasCmd.java:282)
>>>[junit4]   2>at
>>> org.apache.solr.cloud.api.collections.CreateAliasCmd.callCreateRoutedAlias(CreateAliasCmd.java:124)
>>>[junit4]   2>at
>>> org.apache.solr.cloud.api.collections.CreateAliasCmd.call(CreateAliasCmd.java:65)
>>>[junit4]   2>at
>>> org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:252)
>>>[junit4]   2>at
>>> org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:469)
>>>[junit4]   2>at
>>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
>>>[junit4]   2>at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>[junit4]   2>at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>[junit4]   2>at java.lang.Thread.run(Thread.java:748)
>>>[junit4]   2>
>>>
>>> in the full log for
>>> https://jenkins.thetaphi.de/view/Lucene-Solr/job/Lucene-Solr-master-Solaris/1719/console
>>>
>>> After some digging I found this code in ZkTestServer:
>>>
>>> public void run() throws InterruptedException {
>>>   log.info("STARTING ZK TEST SERVER");
>>>   // we don't call super.distribSetUp
>>>   zooThread = new Thread() {
>>>
>>> @Override
>>> public void run() {
>>>   ServerConfig config = new ServerConfig() {
>>>
>>> {
>>>   setClientPort(ZkTestServer.this.clientPort);
>>>   this.da

Re: new candidate list for BadApple-ing on Saturday, 17-Mar

2018-03-14 Thread Mark Miller
Can you file a JIRA issue?

In general, we want to be super forgiving by default and only harsh for a
specific test that might demand it.

- Mark

On Wed, Mar 14, 2018 at 9:45 PM Gus Heck  wrote:

> Beginning to answer my own question...
>
> public abstract class AbstractZkTestCase extends SolrTestCaseJ4 {
>   private static final String ZOOKEEPER_FORCE_SYNC = "zookeeper.forceSync";
>
>   public static final int TIMEOUT = 45000;
>
> seems to be at least one place we are probably not getting what we
> expect... the server's going to cut that back to the 20 second max...
>
> On Wed, Mar 14, 2018 at 10:36 PM, Gus Heck  wrote:
>
>> Being slightly irritated by the fact one of my tests shows up in this, I
>> did some digging and I found
>>
>>[junit4]   2> 485679 ERROR 
>> (OverseerThreadFactory-788-thread-1-processing-n:127.0.0.1:35771_solr) 
>> [n:127.0.0.1:35771_solr] o.a.s.c.a.c.OverseerCollectionMessageHandler 
>> Collection: testAliasCplx0 operation: createalias 
>> failed:org.apache.zookeeper.KeeperException$SessionExpiredException: 
>> KeeperErrorCode = Session expired for /configs/_default
>>[junit4]   2> at 
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
>>[junit4]   2> at 
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>>[junit4]   2> at 
>> org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1105)
>>[junit4]   2> at 
>> org.apache.solr.common.cloud.SolrZkClient.lambda$exists$3(SolrZkClient.java:316)
>>[junit4]   2> at 
>> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
>>[junit4]   2> at 
>> org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:316)
>>[junit4]   2> at 
>> org.apache.solr.client.solrj.impl.ZkDistribStateManager.hasData(ZkDistribStateManager.java:58)
>>[junit4]   2> at 
>> org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.validateConfigOrThrowSolrException(OverseerCollectionMessageHandler.java:737)
>>[junit4]   2> at 
>> org.apache.solr.cloud.api.collections.CreateCollectionCmd.call(CreateCollectionCmd.java:114)
>>[junit4]   2> at 
>> org.apache.solr.cloud.api.collections.MaintainRoutedAliasCmd.createCollectionAndWait(MaintainRoutedAliasCmd.java:282)
>>[junit4]   2> at 
>> org.apache.solr.cloud.api.collections.CreateAliasCmd.callCreateRoutedAlias(CreateAliasCmd.java:124)
>>[junit4]   2> at 
>> org.apache.solr.cloud.api.collections.CreateAliasCmd.call(CreateAliasCmd.java:65)
>>[junit4]   2> at 
>> org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:252)
>>[junit4]   2> at 
>> org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:469)
>>[junit4]   2> at 
>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
>>[junit4]   2> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>[junit4]   2> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>[junit4]   2> at java.lang.Thread.run(Thread.java:748)
>>[junit4]   2>
>>
>> in the full log for
>> https://jenkins.thetaphi.de/view/Lucene-Solr/job/Lucene-Solr-master-Solaris/1719/console
>>
>> After some digging I found this code in ZkTestServer:
>>
>> public void run() throws InterruptedException {
>>   log.info("STARTING ZK TEST SERVER");
>>   // we don't call super.distribSetUp
>>   zooThread = new Thread() {
>>
>> @Override
>> public void run() {
>>   ServerConfig config = new ServerConfig() {
>>
>> {
>>   setClientPort(ZkTestServer.this.clientPort);
>>   this.dataDir = zkDir;
>>   this.dataLogDir = zkDir;
>>   this.tickTime = theTickTime;
>> }
>>
>> public void setClientPort(int clientPort) {
>>   if (clientPortAddress != null) {
>> try {
>>   this.clientPortAddress = new InetSocketAddress(
>>   
>> InetAddress.getByName(clientPortAddress.getHostName()), clientPort);
>> } catch (UnknownHostException e) {
>>   throw new RuntimeException(e);
>> }
>>   } else {
>> this.clientPortAddress = new InetSocketAddress(clientPort);
>>   }
>>   log.info("client port:" + this.clientPortAddress);
>> }
>>   };
>>
>>   try {
>> zkServer.runFromConfig(config);
>>   } catch (Throwable e) {
>> throw new RuntimeException(e);
>>   }
>> }
>>   };
>>
>> And what I noticed is that min/max timeouts are unset and theTickTime is
>> onlly ever set to a big blue 1000 leading to default min/max timeout
>> values of 2/20 seconds (
>> https://discuss.pivotal.io/hc/en-us/articles/205187157-Pivotal-HD-About-how-to-correctly-config-zookeeper-session-timeo

Re: new candidate list for BadApple-ing on Saturday, 17-Mar

2018-03-14 Thread Gus Heck
Beginning to answer my own question...

public abstract class AbstractZkTestCase extends SolrTestCaseJ4 {
  private static final String ZOOKEEPER_FORCE_SYNC = "zookeeper.forceSync";

  public static final int TIMEOUT = 45000;

seems to be at least one place we are probably not getting what we
expect... the server's going to cut that back to the 20 second max...

On Wed, Mar 14, 2018 at 10:36 PM, Gus Heck  wrote:

> Being slightly irritated by the fact one of my tests shows up in this, I
> did some digging and I found
>
>[junit4]   2> 485679 ERROR 
> (OverseerThreadFactory-788-thread-1-processing-n:127.0.0.1:35771_solr) 
> [n:127.0.0.1:35771_solr] o.a.s.c.a.c.OverseerCollectionMessageHandler 
> Collection: testAliasCplx0 operation: createalias 
> failed:org.apache.zookeeper.KeeperException$SessionExpiredException: 
> KeeperErrorCode = Session expired for /configs/_default
>[junit4]   2>  at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
>[junit4]   2>  at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>[junit4]   2>  at 
> org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1105)
>[junit4]   2>  at 
> org.apache.solr.common.cloud.SolrZkClient.lambda$exists$3(SolrZkClient.java:316)
>[junit4]   2>  at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
>[junit4]   2>  at 
> org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:316)
>[junit4]   2>  at 
> org.apache.solr.client.solrj.impl.ZkDistribStateManager.hasData(ZkDistribStateManager.java:58)
>[junit4]   2>  at 
> org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.validateConfigOrThrowSolrException(OverseerCollectionMessageHandler.java:737)
>[junit4]   2>  at 
> org.apache.solr.cloud.api.collections.CreateCollectionCmd.call(CreateCollectionCmd.java:114)
>[junit4]   2>  at 
> org.apache.solr.cloud.api.collections.MaintainRoutedAliasCmd.createCollectionAndWait(MaintainRoutedAliasCmd.java:282)
>[junit4]   2>  at 
> org.apache.solr.cloud.api.collections.CreateAliasCmd.callCreateRoutedAlias(CreateAliasCmd.java:124)
>[junit4]   2>  at 
> org.apache.solr.cloud.api.collections.CreateAliasCmd.call(CreateAliasCmd.java:65)
>[junit4]   2>  at 
> org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:252)
>[junit4]   2>  at 
> org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:469)
>[junit4]   2>  at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
>[junit4]   2>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>[junit4]   2>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>[junit4]   2>  at java.lang.Thread.run(Thread.java:748)
>[junit4]   2>
>
> in the full log for https://jenkins.thetaphi.de/view/Lucene-Solr/job/
> Lucene-Solr-master-Solaris/1719/console
>
> After some digging I found this code in ZkTestServer:
>
> public void run() throws InterruptedException {
>   log.info("STARTING ZK TEST SERVER");
>   // we don't call super.distribSetUp
>   zooThread = new Thread() {
>
> @Override
> public void run() {
>   ServerConfig config = new ServerConfig() {
>
> {
>   setClientPort(ZkTestServer.this.clientPort);
>   this.dataDir = zkDir;
>   this.dataLogDir = zkDir;
>   this.tickTime = theTickTime;
> }
>
> public void setClientPort(int clientPort) {
>   if (clientPortAddress != null) {
> try {
>   this.clientPortAddress = new InetSocketAddress(
>   InetAddress.getByName(clientPortAddress.getHostName()), 
> clientPort);
> } catch (UnknownHostException e) {
>   throw new RuntimeException(e);
> }
>   } else {
> this.clientPortAddress = new InetSocketAddress(clientPort);
>   }
>   log.info("client port:" + this.clientPortAddress);
> }
>   };
>
>   try {
> zkServer.runFromConfig(config);
>   } catch (Throwable e) {
> throw new RuntimeException(e);
>   }
> }
>   };
>
> And what I noticed is that min/max timeouts are unset and theTickTime is
> onlly ever set to a big blue 1000 leading to default min/max timeout
> values of 2/20 seconds (https://discuss.pivotal.io/
> hc/en-us/articles/205187157-Pivotal-HD-About-how-to-
> correctly-config-zookeeper-session-timeout-parameter-
> minSessionTimeout-and-maxSessionTimeout --> jibes with the zk code I see
> in my editor). So my question is whether or not our big blue friend can be
> given more time, or is this by design and a config level we wish to
> explicitly support?
>
> Note that this is not the d

Re: new candidate list for BadApple-ing on Saturday, 17-Mar

2018-03-14 Thread Gus Heck
Being slightly irritated by the fact one of my tests shows up in this, I
did some digging and I found

   [junit4]   2> 485679 ERROR
(OverseerThreadFactory-788-thread-1-processing-n:127.0.0.1:35771_solr)
[n:127.0.0.1:35771_solr]
o.a.s.c.a.c.OverseerCollectionMessageHandler Collection:
testAliasCplx0 operation: createalias
failed:org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /configs/_default
   [junit4]   2>at
org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
   [junit4]   2>at
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
   [junit4]   2>at 
org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1105)
   [junit4]   2>at
org.apache.solr.common.cloud.SolrZkClient.lambda$exists$3(SolrZkClient.java:316)
   [junit4]   2>at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
   [junit4]   2>at
org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:316)
   [junit4]   2>at
org.apache.solr.client.solrj.impl.ZkDistribStateManager.hasData(ZkDistribStateManager.java:58)
   [junit4]   2>at
org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.validateConfigOrThrowSolrException(OverseerCollectionMessageHandler.java:737)
   [junit4]   2>at
org.apache.solr.cloud.api.collections.CreateCollectionCmd.call(CreateCollectionCmd.java:114)
   [junit4]   2>at
org.apache.solr.cloud.api.collections.MaintainRoutedAliasCmd.createCollectionAndWait(MaintainRoutedAliasCmd.java:282)
   [junit4]   2>at
org.apache.solr.cloud.api.collections.CreateAliasCmd.callCreateRoutedAlias(CreateAliasCmd.java:124)
   [junit4]   2>at
org.apache.solr.cloud.api.collections.CreateAliasCmd.call(CreateAliasCmd.java:65)
   [junit4]   2>at
org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:252)
   [junit4]   2>at
org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:469)
   [junit4]   2>at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
   [junit4]   2>at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   [junit4]   2>at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   [junit4]   2>at java.lang.Thread.run(Thread.java:748)
   [junit4]   2>

in the full log for
https://jenkins.thetaphi.de/view/Lucene-Solr/job/Lucene-Solr-master-Solaris/1719/console

After some digging I found this code in ZkTestServer:

public void run() throws InterruptedException {
  log.info("STARTING ZK TEST SERVER");
  // we don't call super.distribSetUp
  zooThread = new Thread() {

@Override
public void run() {
  ServerConfig config = new ServerConfig() {

{
  setClientPort(ZkTestServer.this.clientPort);
  this.dataDir = zkDir;
  this.dataLogDir = zkDir;
  this.tickTime = theTickTime;
}

public void setClientPort(int clientPort) {
  if (clientPortAddress != null) {
try {
  this.clientPortAddress = new InetSocketAddress(

InetAddress.getByName(clientPortAddress.getHostName()), clientPort);
} catch (UnknownHostException e) {
  throw new RuntimeException(e);
}
  } else {
this.clientPortAddress = new InetSocketAddress(clientPort);
  }
  log.info("client port:" + this.clientPortAddress);
}
  };

  try {
zkServer.runFromConfig(config);
  } catch (Throwable e) {
throw new RuntimeException(e);
  }
}
  };

And what I noticed is that min/max timeouts are unset and theTickTime is
onlly ever set to a big blue 1000 leading to default min/max timeout values
of 2/20 seconds (
https://discuss.pivotal.io/hc/en-us/articles/205187157-Pivotal-HD-About-how-to-correctly-config-zookeeper-session-timeout-parameter-minSessionTimeout-and-maxSessionTimeout
--> jibes with the zk code I see in my editor). So my question is whether
or not our big blue friend can be given more time, or is this by design and
a config level we wish to explicitly support?

Note that this is not the default zk config, which would be a 3 sec tick
time. Maybe we're unintentionally being harsh on ourselves? I have noticed
that a lot of the tests that fail in my local builds are zk realated



On Wed, Mar 14, 2018 at 5:52 PM, Erick Erickson 
wrote:

> We had a drop off in the number of failing tests over the last couple
> of days, so I'm
> going to ignore the fails 11-12 Mar. Or it was a temporary increase
> for those days, take your
> pick ;)
>
>
> I'll check against Hoss' and Mark's lists and only BadApple tests
> that've failed Friday
> or later.
>
> In particular what do the Lucene folks think about the tw

new candidate list for BadApple-ing on Saturday, 17-Mar

2018-03-14 Thread Erick Erickson
We had a drop off in the number of failing tests over the last couple
of days, so I'm
going to ignore the fails 11-12 Mar. Or it was a temporary increase
for those days, take your
pick ;)


I'll check against Hoss' and Mark's lists and only BadApple tests
that've failed Friday
or later.

In particular what do the Lucene folks think about the two Lucene tests?


junit.framework.TestSuite.org.apache.solr.cloud.TestLeaderElectionZkExpiry

org.apache.lucene.index.TestIndexSorting.testRandom3
org.apache.lucene.index.TestIndexWriterWithThreads.testCloseWithThreads

org.apache.solr.cloud.api.collections.CollectionsAPIAsyncDistributedZkTest.testAsyncIdBackCompat
org.apache.solr.cloud.autoscaling.ComputePlanActionTest.testNodeWithMultipleReplicasLost
org.apache.solr.cloud.autoscaling.ScheduledMaintenanceTriggerTest.testInactiveShardCleanup
org.apache.solr.cloud.ConcurrentCreateRoutedAliasTest.testConcurrentCreateRoutedAliasComplex
org.apache.solr.cloud.DocValuesNotIndexedTest.testGroupingDVOnly
org.apache.solr.cloud.LeaderVoteWaitTimeoutTest.basicTest
org.apache.solr.cloud.LeaderVoteWaitTimeoutTest.testMostInSyncReplicasCanWinElection
org.apache.solr.cloud.MoveReplicaHDFSTest.testFailedMove
org.apache.solr.cloud.SSLMigrationTest.test
org.apache.solr.cloud.TestTlogReplica.testCreateDelete
org.apache.solr.handler.admin.SegmentsInfoRequestHandlerTest.testSegmentInfosVersion
org.apache.solr.handler.TestSolrConfigHandlerCloud.test
org.apache.solr.logging.TestLogWatcher.testLog4jWatcher
org.apache.solr.spelling.SpellCheckCollatorTest.testEstimatedHitCounts


Fails by day are below for reference. If they're _not_ listed above, they'll be
left to run for another week:

11-Mar fails:

junit.framework.TestSuite.org.apache.solr.cloud.BasicDistributedZkTestream.StreamExpressionTest.testGammaDistribution
junit.framework.TestSuite.org.apache.solr.cloud.TestCloudPivotFacet
junit.framework.TestSuite.org.apache.solr.cloud.TriLevelCompositeIdRoutingTest
junit.framework.TestSuite.org.apache.solr.cloud.ZkControllerTest
junit.framework.TestSuite.org.apache.solr.search.join.BlockJoinFacetDistribTest
org.apache.solr.client.solrj.io.storg.apache.solr.cloud.BasicDistributedZkTest.test
org.apache.solr.cloud.autoscaling.ScheduledMaintenanceTriggerTest.testInactiveShardCleanup
org.apache.solr.cloud.BasicDistributedZkTest.test
org.apache.solr.cloud.FullSolrCloudDistribCmdsTest
org.apache.solr.cloud.FullSolrCloudDistribCmdsTest.test
org.apache.solr.cloud.hdfs.HdfsUnloadDistributedZkTest
org.apache.solr.cloud.hdfs.HdfsUnloadDistributedZkTest.test
org.apache.solr.cloud.hdfs.StressHdfsTest.test
org.apache.solr.cloud.MoveReplicaHDFSTest.testFailedMove
org.apache.solr.cloud.TestCloudPivotFacet.test
org.apache.solr.cloud.TriLevelCompositeIdRoutingTest.test
org.apache.solr.logging.TestLogWatcher.testLog4jWatcher

12-Mar fails:
junit.framework.TestSuite.org.apache.lucene.search.spans.TestSpanSearchEquivalence
junit.framework.TestSuite.org.apache.solr.cloud.autoscaling.TriggerIntegrationTest
junit.framework.TestSuite.org.apache.solr.cloud.BasicDistributedZkTest
junit.framework.TestSuite.org.apache.solr.cloud.hdfs.HdfsUnloadDistributedZkTest
junit.framework.TestSuite.org.apache.solr.cloud.TestLeaderElectionZkExpiry
org.apache.lucene.index.TestDuelingCodecsAtNight.testBigEquals
org.apache.lucene.search.spans.TestSpanSearchEquivalence.testSpanNearIncreasingSloppiness
org.apache.solr.cloud.autoscaling.AutoAddReplicasIntegrationTest
org.apache.solr.cloud.autoscaling.HdfsAutoAddReplicasIntegrationTest
org.apache.solr.cloud.autoscaling.ScheduledMaintenanceTriggerTest
org.apache.solr.cloud.autoscaling.ScheduledMaintenanceTriggerTest.testInactiveShardCleanup
org.apache.solr.cloud.autoscaling.TriggerIntegrationTest.testEventQueue
org.apache.solr.cloud.autoscaling.TriggerIntegrationTest.testMetricTrigger
org.apache.solr.cloud.BasicDistributedZkTest.test
org.apache.solr.cloud.CollectionsAPISolrJTest.testSplitShard
org.apache.solr.cloud.LeaderVoteWaitTimeoutTest.testMostInSyncReplicasCanWinElection
org.apache.solr.cloud.MoveReplicaHDFSTest.testFailedMove
org.apache.solr.cloud.SSLMigrationTest.test
org.apache.solr.cloud.TestRandomFlRTGCloud.testRandomizedUpdatesAndRTGs
org.apache.solr.core.TestJmxIntegration
org.apache.solr.handler.admin.AutoscalingHistoryHandlerTest
org.apache.solr.handler.TestReplicationHandler
org.apache.solr.handler.TestReplicationHandler.doTestReplicateAfterCoreReload
org.apache.solr.logging.TestLogWatcher.testLog4jWatcher
org.apache.solr.ltr.TestLTRReRankingPipeline
org.apache.solr.TestDistributedSearch.test

13-Mar fails:
junit.framework.TestSuite.org.apache.solr.cloud.TestLeaderElectionZkExpiry
org.apache.lucene.index.TestIndexSorting.testRandom3
org.apache.lucene.index.TestIndexWriterWithThreads.testCloseWithThreads
org.apache.solr.cloud.autoscaling.ScheduledMaintenanceTriggerTest.testInactiveShardCleanup
org.apache.solr.cloud.ConcurrentCreateRoutedAliasTest.testConcurrentCreateRoutedAliasComplex
org.apache.solr.cloud.LeaderVoteWaitTimeoutTe