[
https://issues.apache.org/jira/browse/SOLR-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545768#comment-14545768
]
Steven Bower commented on SOLR-7550:
------------------------------------
also curious how a core that is not "active" would be used for this peersync..
> PeerSync fails if a replica returns 500 error
> ---------------------------------------------
>
> Key: SOLR-7550
> URL: https://issues.apache.org/jira/browse/SOLR-7550
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.8.1, 4.10.2
> Environment: linux
> Reporter: Steven Bower
> Priority: Critical
>
> 4 node cluster we stopped a node and started that node back up. Prior to the
> node starting up a schema change was made that was invalid. When the node
> started back up the core could not load as the schema was invalid. While in
> this state the leader was restarted as well (so now two nodes in this bad
> state). When the remaining two nodes attempted to become leader and PeerSync
> they were getting a 500 error back from these failed-to-start cores and were
> not able to become leaders, which eventually lead to the remaining two nodes
> ending up in "recovery_failed" state and the cluster being offline.
> Some logs:
> {noformat}
> 2015-05-14 17:03:20.712 INFO ShardLeaderElectionContext [main-EventThread] -
> Running the leader process for shard shard1
> 2015-05-14 17:03:20.720 INFO ShardLeaderElectionContext [main-EventThread] -
> Checking if I should try and be the leader.
> 2015-05-14 17:03:20.720 INFO ShardLeaderElectionContext [main-EventThread] -
> My last published State was Active, it's okay to be the leader.
> 2015-05-14 17:03:20.720 INFO ShardLeaderElectionContext [main-EventThread] -
> I may be the new leader - try and sync
> 2015-05-14 17:03:20.720 WARN RecoveryStrategy [main-EventThread] - Stopping
> recovery for zkNodeName=host-a2:12345_solr_xxxxcore=xxxx
> 2015-05-14 17:03:23.220 INFO SyncStrategy [main-EventThread] - Sync replicas
> to http://host-a2:12345/solr/xxxx/
> 2015-05-14 17:03:23.221 INFO PeerSync [main-EventThread] - PeerSync:
> core=xxxx url=http://host-a2:12345/solr START
> replicas=[http://host-b1:12345/solr/xxxx/,
> http://host-a1:12345/solr/xxxx_shard1/] nUpdates=100
> 2015-05-14 17:03:23.238 INFO PeerSync [main-EventThread] - PeerSync:
> core=xxxx url=http://host-a2:12345/solr Received 96 versions from
> http://host-b1:12345/solr/xxxx/
> 2015-05-14 17:03:23.239 INFO PeerSync [main-EventThread] - PeerSync:
> core=xxxx url=http://host-a2:12345/solr Our versions are newer.
> ourLowThreshold=1501178223728263172 otherHigh=1501178223745040385
> 2015-05-14 17:03:23.385 WARN PeerSync [main-EventThread] - PeerSync:
> core=xxxx url=http://host-a2:12345/solr exception talking to
> http://host-a1:12345/solr/xxxx_shard1/, failed
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> Expected mime type application/octet-stream but got text/html. <html>
> <head>
> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
> <title>Error 500 {msg=SolrCore 'xxxx_shard1' is not available due to init
> failure: Could not load conf for core xxxx_shard1: Plugin init failure for
> [schema.xml] fieldType "text_split_colon": Plugin init failure for
> [schema.xml] analyzer/filter: Error loading class 'XXXXXXXXXXXXXX'. Schema
> file is /configs/xxxx/schema.xml,trace=org.apache.solr.common.SolrException:
> SolrCore 'xxxx_shard1' is not available due to init failure: Could not load
> conf for core xxxx_shard1: Plugin init failure for [schema.xml] fieldType
> "some_field_type": Plugin init failure for [schema.xml] analyzer/filter:
> Error loading class 'XXXXXXXXXXXXXXX'. Schema file is /configs/xxxx/schema.xml
> at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:299)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
> ...
> ...
> ...
> {noformat}
> It looks as though the error handling is a bit brittle in that it can
> tolerate connection issues, 503 and 404 errors but anything else would cause
> a cluster that needed to leader elect and had a node in a bad state to fail.
> If just adding support for 500 errors is seen as the best approach that is a
> simple fix and I can put a patch up quickly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]