[
https://issues.apache.org/jira/browse/SOLR-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502084#comment-13502084
]
Markus Jelsma commented on SOLR-4032:
-------------------------------------
I'm not exactly sure on how to precisely reproduce this condition but i've been
three times successful in doing so by randomly shutting down some nodes cleanly
and make sure at least one node still has the shard while indexing a few
million documents from a few reducers in Hadoop.
Here's one snippet where a node cannot download:
{code}
2012-11-20 14:44:17,365 INFO [solr.handler.ReplicationHandler] -
[RecoveryThread] - : Commits will be reserved for 10000
2012-11-20 14:44:17,365 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - :
Creating new http client,
config:connTimeout=5000&socketTimeout=20000&allowCompression=false&maxConnections=10000&maxConnect
ionsPerHost=10000
2012-11-20 14:44:17,367 INFO [solr.core.SolrCore] - [http-8080-exec-4] - :
[shard_a] webapp=/solr path=/admin/ping params={} status=503 QTime=0
2012-11-20 14:44:17,377 INFO [solr.handler.SnapPuller] - [RecoveryThread] - :
No value set for 'pollInterval'. Timer Task not started.
2012-11-20 14:44:17,440 INFO [solr.handler.SnapPuller] - [RecoveryThread] - :
Master's generation: 62
2012-11-20 14:44:17,440 INFO [solr.handler.SnapPuller] - [RecoveryThread] - :
Slave's generation: 62
2012-11-20 14:44:17,440 INFO [solr.handler.SnapPuller] - [RecoveryThread] - :
Starting replication process
2012-11-20 14:44:17,449 INFO [solr.handler.SnapPuller] - [RecoveryThread] - :
Number of files in latest index in master: 287
2012-11-20 14:44:17,450 WARN [solr.core.CachingDirectoryFactory] -
[RecoveryThread] - : No lockType configured for
NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_j/data/index.20121120144417450
lockFactory=org.apache.lucene.store.NativeFSLockFactory@5166b0df;
maxCacheMB=48.0 maxMergeSizeMB=4.0) assuming 'simple'
2012-11-20 14:44:17,450 INFO [solr.core.CachingDirectoryFactory] -
[RecoveryThread] - : return new directory for
/opt/solr/cores/shard_j/data/index.20121120144417450 forceNew:false
2012-11-20 14:44:17,450 INFO [solr.core.CachingDirectoryFactory] -
[RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_j/data
2012-11-20 14:44:17,451 INFO [solr.handler.SnapPuller] - [RecoveryThread] - :
Starting download to
NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_j/data/index.20121120144417450
lockFactory=org.apache.lucene.store.SimpleFSLockFactory@3ade1b36;
maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true
2012-11-20 14:44:18,298 WARN [solr.handler.SnapPuller] - [RecoveryThread] - :
Error in fetching packets
java.io.EOFException
at
org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:151)
at
org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:144)
at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchPackets(SnapPuller.java:1143)
at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1107)
at
org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716)
at
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387)
at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
2012-11-20 14:44:18,314 INFO [solr.core.CachingDirectoryFactory] -
[RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_j/data/index
2012-11-20 14:44:18,314 INFO [solr.handler.SnapPuller] - [RecoveryThread] - :
removing temporary index download directory files
NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_j/data/index.20121120144417450
lockFactory=org.apache.lucene.store.SimpleFSLockFactory@3ade1b36;
maxCacheMB=48.0 maxMergeSizeMB=4.0)
2012-11-20 14:44:18,372 INFO [solr.core.CachingDirectoryFactory] -
[RecoveryThread] - : Releasing
directory:/opt/solr/cores/shard_j/data/index.20121120144417450
2012-11-20 14:44:18,374 INFO [solr.core.CachingDirectoryFactory] -
[RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_j/data/index
2012-11-20 14:44:18,374 ERROR [solr.handler.ReplicationHandler] -
[RecoveryThread] - : SnapPull failed :org.apache.solr.common.SolrException:
Unable to download _1t6.fdt completely. Downloaded 9437184!=9501680
at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapPuller.java:1237)
at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1118)
at
org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716)
at
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387)
at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
2012-11-20 14:44:18,374 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread]
- : Error while trying to recover:org.apache.solr.common.SolrException:
Replication for recovery failed.
at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:155)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
2012-11-20 14:44:18,374 INFO [solr.update.UpdateLog] - [RecoveryThread] - :
Dropping buffered updates FSUpdateLog{state=BUFFERING,
tlog=tlog{file=/opt/solr/cores/shard_j/data/tlog/tlog.0000000000000000059
refcount=1}}
2012-11-20 14:44:18,374 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread]
- : Recovery failed - trying again... (0) core=shard_j
2012-11-20 14:44:18,374 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] -
: Wait 2.0 seconds before trying to recover again (1)
{code}
And here's a snippet where a node has trouble with it's core instance directory:
{code}
2012-11-20 14:45:54,863 ERROR [solr.core.DirectoryFactory] - [RecoveryThread] -
: Error deleting files from
Directory:org.apache.lucene.store.NoSuchDirectoryException: directory
'/opt/solr/cores/shard_b/data/index.20121120144554468' does not exist
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:217)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:240)
at
org.apache.lucene.store.NRTCachingDirectory.listAll(NRTCachingDirectory.java:132)
at
org.apache.solr.core.DirectoryFactory.empty(DirectoryFactory.java:174)
at
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:469)
at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
2012-11-20 14:45:54,866 INFO [solr.core.CachingDirectoryFactory] -
[RecoveryThread] - : Releasing
directory:/opt/solr/cores/shard_b/data/index.20121120144554468
2012-11-20 14:45:54,866 INFO [solr.core.CachingDirectoryFactory] -
[RecoveryThread] - : Releasing
directory:/opt/solr/cores/shard_b/data/index.20121120122241377
2012-11-20 14:45:54,866 ERROR [solr.handler.ReplicationHandler] -
[RecoveryThread] - : SnapPull failed :org.apache.solr.common.SolrException:
Unable to download _22j_Lucene41_0.tim completely. Downloaded 4194304!=4498323
at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapPuller.java:1237)
at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1118)
at
org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716)
at
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387)
at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
2012-11-20 14:45:54,866 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread]
- : Error while trying to recover:org.apache.solr.common.SolrException:
Replication for recovery failed.
at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:155)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
2012-11-20 14:45:54,866 INFO [solr.update.UpdateLog] - [RecoveryThread] - :
Dropping buffered updates FSUpdateLog{state=BUFFERING,
tlog=tlog{file=/opt/solr/cores/shard_b/data/tlog/tlog.0000000000000000059
refcount=1}}
2012-11-20 14:45:54,867 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread]
- : Recovery failed - trying again... (0) core=shard_b
2012-11-20 14:45:54,867 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] -
: Wait 2.0 seconds before trying to recover again (1)
2012-11-20 14:45:54,979 WARN [solr.handler.SnapPuller] - [RecoveryThread] - :
Error in fetching packets
java.io.EOFException
at
org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:151)
at
org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:144)
at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchPackets(SnapPuller.java:1143)
at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1107)
at
org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716)
at
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387)
at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
2012-11-20 14:45:54,985 INFO [solr.core.CachingDirectoryFactory] -
[RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_a/data/index
{code}
There AIOOBException seems to sometimes accompany when other exceptions occur.
{code}
2012-11-20 14:46:06,963 INFO [solr.handler.ReplicationHandler] -
[http-8080-exec-6] - : Commits will be reserved for 10000
2012-11-20 14:46:06,963 INFO [solr.core.SolrCore] - [http-8080-exec-6] - :
[shard_a] webapp=/solr path=/replication
params={command=indexversion&wt=javabin&qt=/replication&version=2} status=0
QTime=1
2012-11-20 14:46:06,967 INFO [solr.core.CachingDirectoryFactory] -
[http-8080-exec-1] - : Releasing directory:/opt/solr/cores/shard_a/data
2012-11-20 14:46:06,971 INFO [solr.core.CachingDirectoryFactory] -
[http-8080-exec-1] - : Releasing directory:/opt/solr/cores/shard_a/data/index
2012-11-20 14:46:06,971 INFO [solr.core.SolrCore] - [http-8080-exec-1] - :
[shard_a] webapp=/solr path=/replication
params={command=filelist&generation=63&wt=javabin&qt=/replication&version=2}
status=0 QTime=4
2012-11-20 14:46:06,982 INFO [solr.core.SolrCore] - [http-8080-exec-2] - :
[shard_a] webapp=/solr path=/replication
params={file=_216_Lucene41_0.doc&command=filecontent&checksum=true&generation=63&qt=/replication&wt=filestream}
status=0 QTime=2
2012-11-20 14:46:06,989 INFO [solr.core.CachingDirectoryFactory] -
[http-8080-exec-5] - : Releasing directory:/opt/solr/cores/shard_a/data/index
2012-11-20 14:46:06,997 INFO [solr.core.CachingDirectoryFactory] -
[http-8080-exec-5] - : Releasing directory:/opt/solr/cores/shard_j/data/index
2012-11-20 14:46:07,018 INFO [solr.core.SolrCore] - [http-8080-exec-6] - :
[shard_a] webapp=/solr path=/admin/system params={wt=json} status=0 QTime=5
2012-11-20 14:46:07,031 INFO [solr.core.SolrCore] - [http-8080-exec-4] - :
[shard_a] webapp=/solr path=/replication
params={file=_216_Lucene41_0.pos&command=filecontent&checksum=true&generation=63&qt=/replication&wt=filestream}
status=0 QTime=0
2012-11-20 14:46:07,067 INFO [solr.core.SolrCore] - [http-8080-exec-2] - :
[shard_a] webapp=/solr path=/replication
params={file=_22q_Lucene41_0.tip&command=filecontent&checksum=true&generation=63&qt=/replication&wt=filestream}
status=0 QTime=0
2012-11-20 14:46:07,070 INFO [solr.core.SolrCore] - [http-8080-exec-1] - :
[shard_a] webapp=/solr path=/replication
params={file=_1tu_Lucene41_0.doc&command=filecontent&checksum=true&generation=63&qt=/replication&wt=filestream}
status=0 QTime=0
2012-11-20 14:46:07,132 INFO [solr.core.SolrCore] - [http-8080-exec-6] - :
[shard_a] webapp=/solr path=/replication
params={file=_1t6.fdt&command=filecontent&checksum=true&generation=63&qt=/replication&wt=filestream}
status=0 QTime=0
2012-11-20 14:46:07,334 INFO [solr.core.SolrCore] - [http-8080-exec-2] - :
[shard_a] webapp=/solr path=/admin/ping params={} hits=29 status=0 QTime=15
2012-11-20 14:46:07,334 INFO [solr.core.SolrCore] - [http-8080-exec-2] - :
[shard_a] webapp=/solr path=/admin/ping params={} status=0 QTime=15
2012-11-20 14:46:07,353 INFO [solr.core.SolrCore] - [http-8080-exec-1] - :
[shard_a] webapp=/solr path=/admin/ping params={} hits=29 status=0 QTime=17
2012-11-20 14:46:07,353 INFO [solr.core.SolrCore] - [http-8080-exec-1] - :
[shard_a] webapp=/solr path=/admin/ping params={} status=0 QTime=17
2012-11-20 14:46:07,483 WARN [solr.handler.ReplicationHandler] -
[http-8080-exec-6] - : Exception while writing response for params:
file=_1t6.fdt&command=filecontent&checksum=true&generation=63&qt=/replication&wt=filestream
java.io.EOFException: read past EOF:
MMapIndexInput(path="/opt/solr/cores/shard_a/data/index/_1t6.fdt")
at
org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:100)
at
org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065)
at
org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at
org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
at
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
2012-11-20 14:46:07,494 INFO [solr.core.SolrCore] - [http-8080-exec-2] - :
[shard_a] webapp=/solr path=/replication
params={file=_1t6.fdt&command=filecontent&checksum=true&offset=4194304&generation=63&qt=/replication&wt=filestream}
status=0 QTime=1
2012-11-20 14:46:07,495 ERROR [solr.servlet.SolrDispatchFilter] -
[http-8080-exec-2] - : null:java.lang.IndexOutOfBoundsException
at java.nio.Buffer.checkBounds(Buffer.java:530)
at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:218)
at
org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:91)
at
org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065)
at
org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at
org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
at
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
{code}
I'm positive that the above exceptions do not occur when i index a few million
docs without taking down nodes.
You're right, it's on trunk indeed, not 4.x.
> Unable to replicate between nodes ( read past EOF)
> --------------------------------------------------
>
> Key: SOLR-4032
> URL: https://issues.apache.org/jira/browse/SOLR-4032
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.0
> Environment: 5.0-SNAPSHOT 1366361:1404534M - markus - 2012-11-01
> 12:37:38
> Debian Squeeze, Tomcat 6, Sun Java 6, 10 nodes, 10 shards, rep. factor 2.
> Reporter: Markus Jelsma
> Assignee: Mark Miller
> Fix For: 4.1, 5.0
>
>
> Please see:
> http://lucene.472066.n3.nabble.com/trunk-is-unable-to-replicate-between-nodes-Unable-to-download-completely-td4017049.html
> and
> http://lucene.472066.n3.nabble.com/Possible-memory-leak-in-recovery-td4017833.html
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]