[ 
https://issues.apache.org/jira/browse/SOLR-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895044#comment-15895044
 ] 

Steve Rowe commented on SOLR-9836:
----------------------------------

{{MissingSegmentRecoveryTest.testLeaderRecovery()}} has been failing pretty 
regularly on Jenkins.  Something happened on or about February 10th, when the 
probability of failure went up considerably (and has since remained at this 
elevated level).

I got 3 failures beasting 100 iterations of the test suite using Miller's 
beasting script on my box.  However, for the past three weeks I've see this 
several times a day on my Jenkins, and roughly once a day on either ASF or 
Policeman Jenkins.

Here's a recent failure 
[https://builds.apache.org/job/Lucene-Solr-Tests-master/1699/]:

{noformat}
  [junit4]   2> 599977 ERROR 
(coreLoadExecutor-3254-thread-1-processing-n:127.0.0.1:41308_solr) 
[n:127.0.0.1:41308_solr c:MissingSegmentRecoveryTest s:shard1 r:core_node1 
x:MissingSegmentRecoveryTest_shard1_replica2] o.a.s.u.SolrIndexWriter Error 
closing IndexWriter
  [junit4]   2> java.nio.file.NoSuchFileException: 
/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index.20170228030909468/write.lock
  [junit4]   2>         at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
  [junit4]   2>         at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
  [junit4]   2>         at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
  [junit4]   2>         at 
sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
  [junit4]   2>         at 
sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
  [junit4]   2>         at 
sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
  [junit4]   2>         at java.nio.file.Files.readAttributes(Files.java:1737)
  [junit4]   2>         at 
org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:177)
  [junit4]   2>         at 
org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:67)
  [junit4]   2>         at 
org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4698)
  [junit4]   2>         at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3093)
  [junit4]   2>         at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3227)
  [junit4]   2>         at 
org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1136)
  [junit4]   2>         at 
org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1179)
  [junit4]   2>         at 
org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:291)
  [junit4]   2>         at 
org.apache.solr.core.SolrCore.initIndex(SolrCore.java:728)
  [junit4]   2>         at 
org.apache.solr.core.SolrCore.<init>(SolrCore.java:911)
  [junit4]   2>         at 
org.apache.solr.core.SolrCore.<init>(SolrCore.java:828)
  [junit4]   2>         at 
org.apache.solr.core.CoreContainer.processCoreCreateException(CoreContainer.java:1011)
  [junit4]   2>         at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:939)
  [junit4]   2>         at 
org.apache.solr.core.CoreContainer.lambda$load$3(CoreContainer.java:572)
  [junit4]   2>         at 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
  [junit4]   2>         at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
  [junit4]   2>         at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
  [junit4]   2>         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  [junit4]   2>         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[...]
  [junit4]   2> 600005 ERROR 
(coreContainerWorkExecutor-3250-thread-1-processing-n:127.0.0.1:41308_solr) 
[n:127.0.0.1:41308_solr    ] o.a.s.c.CoreContainer Error waiting for SolrCore 
to be created
  [junit4]   2> java.util.concurrent.ExecutionException: 
org.apache.solr.common.SolrException: Unable to create core 
[MissingSegmentRecoveryTest_shard1_replica2]
  [junit4]   2>         at 
java.util.concurrent.FutureTask.report(FutureTask.java:122)
  [junit4]   2>         at 
java.util.concurrent.FutureTask.get(FutureTask.java:192)
  [junit4]   2>         at 
org.apache.solr.core.CoreContainer.lambda$load$4(CoreContainer.java:600)
  [junit4]   2>         at 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
  [junit4]   2>         at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
  [junit4]   2>         at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
  [junit4]   2>         at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
  [junit4]   2>         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  [junit4]   2>         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  [junit4]   2>         at java.lang.Thread.run(Thread.java:745)
  [junit4]   2> Caused by: org.apache.solr.common.SolrException: Unable to 
create core [MissingSegmentRecoveryTest_shard1_replica2]
  [junit4]   2>         at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:952)
  [junit4]   2>         at 
org.apache.solr.core.CoreContainer.lambda$load$3(CoreContainer.java:572)
  [junit4]   2>         at 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
  [junit4]   2>         ... 5 more
  [junit4]   2> Caused by: org.apache.solr.common.SolrException: Error opening 
new searcher
  [junit4]   2>         at 
org.apache.solr.core.SolrCore.<init>(SolrCore.java:964)
  [junit4]   2>         at 
org.apache.solr.core.SolrCore.<init>(SolrCore.java:828)
  [junit4]   2>         at 
org.apache.solr.core.CoreContainer.processCoreCreateException(CoreContainer.java:1011)
  [junit4]   2>         at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:939)
  [junit4]   2>         ... 7 more
  [junit4]   2>         Suppressed: org.apache.solr.common.SolrException: Error 
opening new searcher
  [junit4]   2>                 at 
org.apache.solr.core.SolrCore.<init>(SolrCore.java:964)
  [junit4]   2>                 at 
org.apache.solr.core.SolrCore.<init>(SolrCore.java:828)
  [junit4]   2>                 at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:937)
  [junit4]   2>                 ... 7 more
  [junit4]   2>         Caused by: org.apache.solr.common.SolrException: Error 
opening new searcher
  [junit4]   2>                 at 
org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2005)
  [junit4]   2>                 at 
org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2125)
  [junit4]   2>                 at 
org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1053)
  [junit4]   2>                 at 
org.apache.solr.core.SolrCore.<init>(SolrCore.java:937)
  [junit4]   2>                 ... 9 more
  [junit4]   2>         Caused by: 
org.apache.lucene.index.CorruptIndexException: Unexpected file read error while 
reading index. 
(resource=BufferedChecksumIndexInput(MMapIndexInput(path="/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index/segments_2")))
  [junit4]   2>                 at 
org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:286)
  [junit4]   2>                 at 
org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:938)
  [junit4]   2>                 at 
org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:125)
  [junit4]   2>                 at 
org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100)
  [junit4]   2>                 at 
org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:240)
  [junit4]   2>                 at 
org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:114)
  [junit4]   2>                 at 
org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1966)
  [junit4]   2>                 ... 12 more
  [junit4]   2>         Caused by: java.io.EOFException: read past EOF: 
MMapIndexInput(path="/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index/segments_2")
  [junit4]   2>                 at 
org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:75)
  [junit4]   2>                 at 
org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
  [junit4]   2>                 at 
org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
  [junit4]   2>                 at 
org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:296)
  [junit4]   2>                 at 
org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:284)
  [junit4]   2>                 ... 18 more
  [junit4]   2> Caused by: org.apache.solr.common.SolrException: Error opening 
new searcher
  [junit4]   2>         at 
org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2005)
  [junit4]   2>         at 
org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2125)
  [junit4]   2>         at 
org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1053)
  [junit4]   2>         at 
org.apache.solr.core.SolrCore.<init>(SolrCore.java:937)
  [junit4]   2>         ... 10 more
  [junit4]   2> Caused by: org.apache.lucene.index.IndexNotFoundException: no 
segments* file found in 
LockValidatingDirectoryWrapper(MMapDirectory@/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index.20170228030909468
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@74782755): files: 
[write.lock]
  [junit4]   2>         at 
org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:933)
  [junit4]   2>         at 
org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:125)
  [junit4]   2>         at 
org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100)
  [junit4]   2>         at 
org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:240)
  [junit4]   2>         at 
org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:114)
  [junit4]   2>         at 
org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1966)
[...]
  [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=MissingSegmentRecoveryTest -Dtests.method=testLeaderRecovery 
-Dtests.seed=B800C15EC6F11C02 -Dtests.multiplier=2 -Dtests.slow=true 
-Dtests.locale=fi-FI -Dtests.timezone=Asia/Famagusta -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
  [junit4] FAILURE 94.6s J2 | MissingSegmentRecoveryTest.testLeaderRecovery <<<
  [junit4]    > Throwable #1: java.lang.AssertionError: Expected a collection 
with one shard and two replicas
  [junit4]    > null
  [junit4]    > Last available state: 
DocCollection(MissingSegmentRecoveryTest//collections/MissingSegmentRecoveryTest/state.json/6)={
  [junit4]    >   "replicationFactor":"2",
  [junit4]    >   "shards":{"shard1":{
  [junit4]    >       "range":"80000000-7fffffff",
  [junit4]    >       "state":"active",
  [junit4]    >       "replicas":{
  [junit4]    >         "core_node1":{
  [junit4]    >           "core":"MissingSegmentRecoveryTest_shard1_replica2",
  [junit4]    >           "base_url":"https://127.0.0.1:41308/solr";,
  [junit4]    >           "node_name":"127.0.0.1:41308_solr",
  [junit4]    >           "state":"down"},
  [junit4]    >         "core_node2":{
  [junit4]    >           "core":"MissingSegmentRecoveryTest_shard1_replica1",
  [junit4]    >           "base_url":"https://127.0.0.1:60247/solr";,
  [junit4]    >           "node_name":"127.0.0.1:60247_solr",
  [junit4]    >           "state":"active",
  [junit4]    >           "leader":"true"}}}},
  [junit4]    >   "router":{"name":"compositeId"},
  [junit4]    >   "maxShardsPerNode":"1",
  [junit4]    >   "autoAddReplicas":"false"}
  [junit4]    >         at 
__randomizedtesting.SeedInfo.seed([B800C15EC6F11C02:E855595D9FD0AA1F]:0)
  [junit4]    >         at 
org.apache.solr.cloud.SolrCloudTestCase.waitForState(SolrCloudTestCase.java:265)
  [junit4]    >         at 
org.apache.solr.cloud.MissingSegmentRecoveryTest.testLeaderRecovery(MissingSegmentRecoveryTest.java:105)
[...]
  [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): 
{_version_=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))),
 id=FST50}, docValues:{}, maxPointsInLeafNode=1106, 
maxMBSortInHeap=6.191537660994534, sim=RandomSimilarity(queryNorm=true): {}, 
locale=fi-FI, timezone=Asia/Famagusta
  [junit4]   2> NOTE: Linux 3.13.0-85-generic amd64/Oracle Corporation 
1.8.0_121 (64-bit)/cpus=4,threads=1,free=138683768,total=527433728
{noformat}


> Add more graceful recovery steps when failing to create SolrCore
> ----------------------------------------------------------------
>
>                 Key: SOLR-9836
>                 URL: https://issues.apache.org/jira/browse/SOLR-9836
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Mike Drob
>            Assignee: Mark Miller
>             Fix For: 6.5, master (7.0)
>
>         Attachments: SOLR-9836.patch, SOLR-9836.patch, SOLR-9836.patch, 
> SOLR-9836.patch, SOLR-9836.patch, SOLR-9836.patch, SOLR-9836.patch
>
>
> I have seen several cases where there is a zero-length segments_n file. We 
> haven't identified the root cause of these issues (possibly a poorly timed 
> crash during replication?) but if there is another node available then Solr 
> should be able to recover from this situation. Currently, we log and give up 
> on loading that core, leaving the user to manually intervene.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to