[jira] [Comment Edited] (SOLR-4117) IO error while trying to get the size of the Directory

Eks Dev (JIRA) Wed, 28 Nov 2012 07:29:00 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505530#comment-13505530
 ]


Eks Dev edited comment on SOLR-4117 at 11/28/12 3:27 PM:
---------------------------------------------------------

fwiw, we *think* we observed the following problem in simple master slave setup 
with NRTCachingDirectory... I am not sure it has something to do with issue, 
because ewe did not see this exception, anyhow   

on replication, slave gets the index from master and works fine, then on:
1. graceful restart, the world looks fine 
2. kill -9 or such, solr does not start because an index gets corrupt (should 
actually not happen)

We speculate that solr now does replication directly to Directory 
implementation and does not ensure that replicated files get fsck-ed completely 
after replication. As far as I remember, replication was going to /temp (disk) 
and than moving files if all went ok. Working under assumption that everything 
is already persisted. Maybe this invariant does not hold any more and some 
explicit fsck is needed for caching directories? 

I might be completely wrong, we just observed symptoms in not really 
debug-friendly environment

Here Exception after  "hard" restart:

Caused by: org.apache.solr.common.SolrException: Error opening new searcher
   at org.apache.solr.core.SolrCore.<init>(SolrCore.java:804)
   at org.apache.solr.core.SolrCore.<init>(SolrCore.java:618)
   at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:973)
   at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1003)
   ... 10 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
   at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1441)
   at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1553)
   at org.apache.solr.core.SolrCore.<init>(SolrCore.java:779)
   ... 13 more
Caused by: java.io.FileNotFoundException: ...\core0\data\index\segments_1 (The 
system cannot find the file specified)
   at java.io.RandomAccessFile.open(Native Method)
   at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
   at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:222)
   at 
org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:232)
   at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:281)
   at 
org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:56)
   at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:668)
   at 
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
   at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:87)
   at 
org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:34)
   at 
org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:120)
   at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1417)
....
 
                
      was (Author: eksdev):
    fwiw, we *think* we observed the following problem in simple master slave 
setup with NRTCachingDirectory... I am not sure it has something to do with 
issue, because ewe did not see this exception, anyhow   

on replication, slave gets the index from master and works fine, then on:
1. graceful restart, the world looks fine 
2. kill -9 or such, solr does not start because an index gets corrupt (should 
actually not happen)

We speculate that solr now does replication directly to Directory 
implementation and does not ensure that replicated files get fsck-ed completely 
after replication. As far as I remember, replication was going to /temp (disk) 
and than moving files if all went ok. Working under assumption that everything 
is already persisted. Maybe this invariant does not hold any more and some 
explicit fsck is needed for caching directories? 

I might be completely wrong, we just observed symptoms in not really 
debug-friendly environment



 
                  
> IO error while trying to get the size of the Directory
> ------------------------------------------------------
>
>                 Key: SOLR-4117
>                 URL: https://issues.apache.org/jira/browse/SOLR-4117
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.0
>         Environment: 5.0.0.2012.11.28.10.42.06
> Debian Squeeze, Tomcat 6, Sun Java 6, 10 nodes, 10 shards, rep. factor 2.
>            Reporter: Markus Jelsma
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 5.0
>
>
> With SOLR-4032 fixed we see other issues when randomly taking down nodes 
> (nicely via tomcat restart) while indexing a few million web pages from 
> Hadoop. We do make sure that at least one node is up for a shard but due to 
> recovery issues it may not be live.
> One node seems to work but generates IO errors in the log and 
> ZookeeperExeption in the GUI. In the GUI we only see:
> {code}
> SolrCore Initialization Failures
>     openindex_f: 
> org.apache.solr.common.cloud.ZooKeeperException:org.apache.solr.common.cloud.ZooKeeperException:
>  
> Please check your logs for more information
> {code}
> and in the log we only see the following exception:
> {code}
> 2012-11-28 11:47:26,652 ERROR [solr.handler.ReplicationHandler] - 
> [http-8080-exec-28] - : IO error while trying to get the size of the 
> Directory:org.apache.lucene.store.NoSuchDirectoryException: directory 
> '/opt/solr/cores/shard_f/data/index' does not exist
>         at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:217)
>         at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:240)
>         at 
> org.apache.lucene.store.NRTCachingDirectory.listAll(NRTCachingDirectory.java:132)
>         at 
> org.apache.solr.core.DirectoryFactory.sizeOfDirectory(DirectoryFactory.java:146)
>         at 
> org.apache.solr.handler.ReplicationHandler.getIndexSize(ReplicationHandler.java:472)
>         at 
> org.apache.solr.handler.ReplicationHandler.getReplicationDetails(ReplicationHandler.java:568)
>         at 
> org.apache.solr.handler.ReplicationHandler.handleRequestBody(ReplicationHandler.java:213)
>         at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
>         at 
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1830)
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:476)
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
>         at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>         at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>         at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>         at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>         at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>         at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>         at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>         at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
>         at 
> org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
>         at 
> org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
>         at 
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-4117) IO error while trying to get the size of the Directory

Reply via email to