[ https://issues.apache.org/jira/browse/SOLR-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505530#comment-13505530 ]
Eks Dev edited comment on SOLR-4117 at 11/28/12 3:27 PM: --------------------------------------------------------- fwiw, we *think* we observed the following problem in simple master slave setup with NRTCachingDirectory... I am not sure it has something to do with issue, because ewe did not see this exception, anyhow on replication, slave gets the index from master and works fine, then on: 1. graceful restart, the world looks fine 2. kill -9 or such, solr does not start because an index gets corrupt (should actually not happen) We speculate that solr now does replication directly to Directory implementation and does not ensure that replicated files get fsck-ed completely after replication. As far as I remember, replication was going to /temp (disk) and than moving files if all went ok. Working under assumption that everything is already persisted. Maybe this invariant does not hold any more and some explicit fsck is needed for caching directories? I might be completely wrong, we just observed symptoms in not really debug-friendly environment Here Exception after "hard" restart: Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.<init>(SolrCore.java:804) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:618) at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:973) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1003) ... 10 more Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1441) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1553) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:779) ... 13 more Caused by: java.io.FileNotFoundException: ...\core0\data\index\segments_1 (The system cannot find the file specified) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233) at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:222) at org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:232) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:281) at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:56) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:668) at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:87) at org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:34) at org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:120) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1417) .... was (Author: eksdev): fwiw, we *think* we observed the following problem in simple master slave setup with NRTCachingDirectory... I am not sure it has something to do with issue, because ewe did not see this exception, anyhow on replication, slave gets the index from master and works fine, then on: 1. graceful restart, the world looks fine 2. kill -9 or such, solr does not start because an index gets corrupt (should actually not happen) We speculate that solr now does replication directly to Directory implementation and does not ensure that replicated files get fsck-ed completely after replication. As far as I remember, replication was going to /temp (disk) and than moving files if all went ok. Working under assumption that everything is already persisted. Maybe this invariant does not hold any more and some explicit fsck is needed for caching directories? I might be completely wrong, we just observed symptoms in not really debug-friendly environment > IO error while trying to get the size of the Directory > ------------------------------------------------------ > > Key: SOLR-4117 > URL: https://issues.apache.org/jira/browse/SOLR-4117 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 5.0 > Environment: 5.0.0.2012.11.28.10.42.06 > Debian Squeeze, Tomcat 6, Sun Java 6, 10 nodes, 10 shards, rep. factor 2. > Reporter: Markus Jelsma > Assignee: Mark Miller > Priority: Minor > Fix For: 5.0 > > > With SOLR-4032 fixed we see other issues when randomly taking down nodes > (nicely via tomcat restart) while indexing a few million web pages from > Hadoop. We do make sure that at least one node is up for a shard but due to > recovery issues it may not be live. > One node seems to work but generates IO errors in the log and > ZookeeperExeption in the GUI. In the GUI we only see: > {code} > SolrCore Initialization Failures > openindex_f: > org.apache.solr.common.cloud.ZooKeeperException:org.apache.solr.common.cloud.ZooKeeperException: > > Please check your logs for more information > {code} > and in the log we only see the following exception: > {code} > 2012-11-28 11:47:26,652 ERROR [solr.handler.ReplicationHandler] - > [http-8080-exec-28] - : IO error while trying to get the size of the > Directory:org.apache.lucene.store.NoSuchDirectoryException: directory > '/opt/solr/cores/shard_f/data/index' does not exist > at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:217) > at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:240) > at > org.apache.lucene.store.NRTCachingDirectory.listAll(NRTCachingDirectory.java:132) > at > org.apache.solr.core.DirectoryFactory.sizeOfDirectory(DirectoryFactory.java:146) > at > org.apache.solr.handler.ReplicationHandler.getIndexSize(ReplicationHandler.java:472) > at > org.apache.solr.handler.ReplicationHandler.getReplicationDetails(ReplicationHandler.java:568) > at > org.apache.solr.handler.ReplicationHandler.handleRequestBody(ReplicationHandler.java:213) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144) > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1830) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:476) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) > at > org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889) > at > org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744) > at > org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org