[
https://issues.apache.org/jira/browse/SOLR-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895044#comment-15895044
]
Steve Rowe commented on SOLR-9836:
----------------------------------
{{MissingSegmentRecoveryTest.testLeaderRecovery()}} has been failing pretty
regularly on Jenkins. Something happened on or about February 10th, when the
probability of failure went up considerably (and has since remained at this
elevated level).
I got 3 failures beasting 100 iterations of the test suite using Miller's
beasting script on my box. However, for the past three weeks I've see this
several times a day on my Jenkins, and roughly once a day on either ASF or
Policeman Jenkins.
Here's a recent failure
[https://builds.apache.org/job/Lucene-Solr-Tests-master/1699/]:
{noformat}
[junit4] 2> 599977 ERROR
(coreLoadExecutor-3254-thread-1-processing-n:127.0.0.1:41308_solr)
[n:127.0.0.1:41308_solr c:MissingSegmentRecoveryTest s:shard1 r:core_node1
x:MissingSegmentRecoveryTest_shard1_replica2] o.a.s.u.SolrIndexWriter Error
closing IndexWriter
[junit4] 2> java.nio.file.NoSuchFileException:
/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index.20170228030909468/write.lock
[junit4] 2> at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
[junit4] 2> at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
[junit4] 2> at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
[junit4] 2> at
sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
[junit4] 2> at
sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
[junit4] 2> at
sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
[junit4] 2> at java.nio.file.Files.readAttributes(Files.java:1737)
[junit4] 2> at
org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:177)
[junit4] 2> at
org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:67)
[junit4] 2> at
org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4698)
[junit4] 2> at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3093)
[junit4] 2> at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3227)
[junit4] 2> at
org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1136)
[junit4] 2> at
org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1179)
[junit4] 2> at
org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:291)
[junit4] 2> at
org.apache.solr.core.SolrCore.initIndex(SolrCore.java:728)
[junit4] 2> at
org.apache.solr.core.SolrCore.<init>(SolrCore.java:911)
[junit4] 2> at
org.apache.solr.core.SolrCore.<init>(SolrCore.java:828)
[junit4] 2> at
org.apache.solr.core.CoreContainer.processCoreCreateException(CoreContainer.java:1011)
[junit4] 2> at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:939)
[junit4] 2> at
org.apache.solr.core.CoreContainer.lambda$load$3(CoreContainer.java:572)
[junit4] 2> at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
[junit4] 2> at
java.util.concurrent.FutureTask.run(FutureTask.java:266)
[junit4] 2> at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
[junit4] 2> at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[junit4] 2> at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[...]
[junit4] 2> 600005 ERROR
(coreContainerWorkExecutor-3250-thread-1-processing-n:127.0.0.1:41308_solr)
[n:127.0.0.1:41308_solr ] o.a.s.c.CoreContainer Error waiting for SolrCore
to be created
[junit4] 2> java.util.concurrent.ExecutionException:
org.apache.solr.common.SolrException: Unable to create core
[MissingSegmentRecoveryTest_shard1_replica2]
[junit4] 2> at
java.util.concurrent.FutureTask.report(FutureTask.java:122)
[junit4] 2> at
java.util.concurrent.FutureTask.get(FutureTask.java:192)
[junit4] 2> at
org.apache.solr.core.CoreContainer.lambda$load$4(CoreContainer.java:600)
[junit4] 2> at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
[junit4] 2> at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[junit4] 2> at
java.util.concurrent.FutureTask.run(FutureTask.java:266)
[junit4] 2> at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
[junit4] 2> at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[junit4] 2> at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[junit4] 2> at java.lang.Thread.run(Thread.java:745)
[junit4] 2> Caused by: org.apache.solr.common.SolrException: Unable to
create core [MissingSegmentRecoveryTest_shard1_replica2]
[junit4] 2> at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:952)
[junit4] 2> at
org.apache.solr.core.CoreContainer.lambda$load$3(CoreContainer.java:572)
[junit4] 2> at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
[junit4] 2> ... 5 more
[junit4] 2> Caused by: org.apache.solr.common.SolrException: Error opening
new searcher
[junit4] 2> at
org.apache.solr.core.SolrCore.<init>(SolrCore.java:964)
[junit4] 2> at
org.apache.solr.core.SolrCore.<init>(SolrCore.java:828)
[junit4] 2> at
org.apache.solr.core.CoreContainer.processCoreCreateException(CoreContainer.java:1011)
[junit4] 2> at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:939)
[junit4] 2> ... 7 more
[junit4] 2> Suppressed: org.apache.solr.common.SolrException: Error
opening new searcher
[junit4] 2> at
org.apache.solr.core.SolrCore.<init>(SolrCore.java:964)
[junit4] 2> at
org.apache.solr.core.SolrCore.<init>(SolrCore.java:828)
[junit4] 2> at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:937)
[junit4] 2> ... 7 more
[junit4] 2> Caused by: org.apache.solr.common.SolrException: Error
opening new searcher
[junit4] 2> at
org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2005)
[junit4] 2> at
org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2125)
[junit4] 2> at
org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1053)
[junit4] 2> at
org.apache.solr.core.SolrCore.<init>(SolrCore.java:937)
[junit4] 2> ... 9 more
[junit4] 2> Caused by:
org.apache.lucene.index.CorruptIndexException: Unexpected file read error while
reading index.
(resource=BufferedChecksumIndexInput(MMapIndexInput(path="/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index/segments_2")))
[junit4] 2> at
org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:286)
[junit4] 2> at
org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:938)
[junit4] 2> at
org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:125)
[junit4] 2> at
org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100)
[junit4] 2> at
org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:240)
[junit4] 2> at
org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:114)
[junit4] 2> at
org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1966)
[junit4] 2> ... 12 more
[junit4] 2> Caused by: java.io.EOFException: read past EOF:
MMapIndexInput(path="/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index/segments_2")
[junit4] 2> at
org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:75)
[junit4] 2> at
org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
[junit4] 2> at
org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
[junit4] 2> at
org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:296)
[junit4] 2> at
org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:284)
[junit4] 2> ... 18 more
[junit4] 2> Caused by: org.apache.solr.common.SolrException: Error opening
new searcher
[junit4] 2> at
org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2005)
[junit4] 2> at
org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2125)
[junit4] 2> at
org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1053)
[junit4] 2> at
org.apache.solr.core.SolrCore.<init>(SolrCore.java:937)
[junit4] 2> ... 10 more
[junit4] 2> Caused by: org.apache.lucene.index.IndexNotFoundException: no
segments* file found in
LockValidatingDirectoryWrapper(MMapDirectory@/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index.20170228030909468
lockFactory=org.apache.lucene.store.NativeFSLockFactory@74782755): files:
[write.lock]
[junit4] 2> at
org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:933)
[junit4] 2> at
org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:125)
[junit4] 2> at
org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100)
[junit4] 2> at
org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:240)
[junit4] 2> at
org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:114)
[junit4] 2> at
org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1966)
[...]
[junit4] 2> NOTE: reproduce with: ant test
-Dtestcase=MissingSegmentRecoveryTest -Dtests.method=testLeaderRecovery
-Dtests.seed=B800C15EC6F11C02 -Dtests.multiplier=2 -Dtests.slow=true
-Dtests.locale=fi-FI -Dtests.timezone=Asia/Famagusta -Dtests.asserts=true
-Dtests.file.encoding=ISO-8859-1
[junit4] FAILURE 94.6s J2 | MissingSegmentRecoveryTest.testLeaderRecovery <<<
[junit4] > Throwable #1: java.lang.AssertionError: Expected a collection
with one shard and two replicas
[junit4] > null
[junit4] > Last available state:
DocCollection(MissingSegmentRecoveryTest//collections/MissingSegmentRecoveryTest/state.json/6)={
[junit4] > "replicationFactor":"2",
[junit4] > "shards":{"shard1":{
[junit4] > "range":"80000000-7fffffff",
[junit4] > "state":"active",
[junit4] > "replicas":{
[junit4] > "core_node1":{
[junit4] > "core":"MissingSegmentRecoveryTest_shard1_replica2",
[junit4] > "base_url":"https://127.0.0.1:41308/solr",
[junit4] > "node_name":"127.0.0.1:41308_solr",
[junit4] > "state":"down"},
[junit4] > "core_node2":{
[junit4] > "core":"MissingSegmentRecoveryTest_shard1_replica1",
[junit4] > "base_url":"https://127.0.0.1:60247/solr",
[junit4] > "node_name":"127.0.0.1:60247_solr",
[junit4] > "state":"active",
[junit4] > "leader":"true"}}}},
[junit4] > "router":{"name":"compositeId"},
[junit4] > "maxShardsPerNode":"1",
[junit4] > "autoAddReplicas":"false"}
[junit4] > at
__randomizedtesting.SeedInfo.seed([B800C15EC6F11C02:E855595D9FD0AA1F]:0)
[junit4] > at
org.apache.solr.cloud.SolrCloudTestCase.waitForState(SolrCloudTestCase.java:265)
[junit4] > at
org.apache.solr.cloud.MissingSegmentRecoveryTest.testLeaderRecovery(MissingSegmentRecoveryTest.java:105)
[...]
[junit4] 2> NOTE: test params are: codec=Asserting(Lucene70):
{_version_=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))),
id=FST50}, docValues:{}, maxPointsInLeafNode=1106,
maxMBSortInHeap=6.191537660994534, sim=RandomSimilarity(queryNorm=true): {},
locale=fi-FI, timezone=Asia/Famagusta
[junit4] 2> NOTE: Linux 3.13.0-85-generic amd64/Oracle Corporation
1.8.0_121 (64-bit)/cpus=4,threads=1,free=138683768,total=527433728
{noformat}
> Add more graceful recovery steps when failing to create SolrCore
> ----------------------------------------------------------------
>
> Key: SOLR-9836
> URL: https://issues.apache.org/jira/browse/SOLR-9836
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrCloud
> Reporter: Mike Drob
> Assignee: Mark Miller
> Fix For: 6.5, master (7.0)
>
> Attachments: SOLR-9836.patch, SOLR-9836.patch, SOLR-9836.patch,
> SOLR-9836.patch, SOLR-9836.patch, SOLR-9836.patch, SOLR-9836.patch
>
>
> I have seen several cases where there is a zero-length segments_n file. We
> haven't identified the root cause of these issues (possibly a poorly timed
> crash during replication?) but if there is another node available then Solr
> should be able to recover from this situation. Currently, we log and give up
> on loading that core, leaving the user to manually intervene.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]