[jira] [Commented] (SOLR-7550) PeerSync fails if a replica returns 500 error

Steven Bower (JIRA) Fri, 15 May 2015 09:47:45 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545768#comment-14545768
 ]


Steven Bower commented on SOLR-7550:
------------------------------------

also curious how a core that is not "active" would be used for this peersync..

> PeerSync fails if a replica returns 500 error
> ---------------------------------------------
>
>                 Key: SOLR-7550
>                 URL: https://issues.apache.org/jira/browse/SOLR-7550
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.8.1, 4.10.2
>         Environment: linux
>            Reporter: Steven Bower
>            Priority: Critical
>
> 4 node cluster we stopped a node and started that node back up. Prior to the 
> node starting up a schema change was made that was invalid. When the node 
> started back up the core could not load as the schema was invalid. While in 
> this state the leader was restarted as well (so now two nodes in this bad 
> state). When the remaining two nodes attempted to become leader and PeerSync 
> they were getting a 500 error back from these failed-to-start cores and were 
> not able to become leaders, which eventually lead to the remaining two nodes 
> ending up in "recovery_failed" state and the cluster being offline.
> Some logs:
> {noformat}
> 2015-05-14 17:03:20.712 INFO  ShardLeaderElectionContext [main-EventThread] - 
> Running the leader process for shard shard1
> 2015-05-14 17:03:20.720 INFO  ShardLeaderElectionContext [main-EventThread] - 
> Checking if I should try and be the leader.
> 2015-05-14 17:03:20.720 INFO  ShardLeaderElectionContext [main-EventThread] - 
> My last published State was Active, it's okay to be the leader.
> 2015-05-14 17:03:20.720 INFO  ShardLeaderElectionContext [main-EventThread] - 
> I may be the new leader - try and sync
> 2015-05-14 17:03:20.720 WARN  RecoveryStrategy [main-EventThread] - Stopping 
> recovery for zkNodeName=host-a2:12345_solr_xxxxcore=xxxx
> 2015-05-14 17:03:23.220 INFO  SyncStrategy [main-EventThread] - Sync replicas 
> to http://host-a2:12345/solr/xxxx/
> 2015-05-14 17:03:23.221 INFO  PeerSync [main-EventThread] - PeerSync: 
> core=xxxx url=http://host-a2:12345/solr START 
> replicas=[http://host-b1:12345/solr/xxxx/, 
> http://host-a1:12345/solr/xxxx_shard1/] nUpdates=100
> 2015-05-14 17:03:23.238 INFO  PeerSync [main-EventThread] - PeerSync: 
> core=xxxx url=http://host-a2:12345/solr  Received 96 versions from 
> http://host-b1:12345/solr/xxxx/
> 2015-05-14 17:03:23.239 INFO  PeerSync [main-EventThread] - PeerSync: 
> core=xxxx url=http://host-a2:12345/solr  Our versions are newer. 
> ourLowThreshold=1501178223728263172 otherHigh=1501178223745040385
> 2015-05-14 17:03:23.385 WARN  PeerSync [main-EventThread] - PeerSync: 
> core=xxxx url=http://host-a2:12345/solr  exception talking to 
> http://host-a1:12345/solr/xxxx_shard1/, failed
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
> Expected mime type application/octet-stream but got text/html. <html>
> <head>
> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
> <title>Error 500 {msg=SolrCore 'xxxx_shard1' is not available due to init 
> failure: Could not load conf for core xxxx_shard1: Plugin init failure for 
> [schema.xml] fieldType "text_split_colon": Plugin init failure for 
> [schema.xml] analyzer/filter: Error loading class 'XXXXXXXXXXXXXX'. Schema 
> file is /configs/xxxx/schema.xml,trace=org.apache.solr.common.SolrException: 
> SolrCore 'xxxx_shard1' is not available due to init failure: Could not load 
> conf for core xxxx_shard1: Plugin init failure for [schema.xml] fieldType 
> "some_field_type": Plugin init failure for [schema.xml] analyzer/filter: 
> Error loading class 'XXXXXXXXXXXXXXX'. Schema file is /configs/xxxx/schema.xml
>       at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)
>       at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:299)
>       at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
>       at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
>       at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
>       at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>       at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
>   ...
>   ...
>   ...
> {noformat}
> It looks as though the error handling is a bit brittle in that it can 
> tolerate connection issues, 503 and 404 errors but anything else would cause 
> a cluster that needed to leader elect and had a node in a bad state to fail.
> If just adding support for 500 errors is seen as the best approach that is a 
> simple fix and I can put a patch up quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-7550) PeerSync fails if a replica returns 500 error

Reply via email to