[
https://issues.apache.org/jira/browse/SOLR-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119905#comment-15119905
]
Jérémie MONSINJON commented on SOLR-8129:
-----------------------------------------
I’m stressing a SolR cluster with Gatling for perf tests. 3 shards, 3 nodes per
shard (1 leader, 2 replicas).
About 300 updates per second using CloudSolrServer (about 100 per shard) and 40
qps.
Using round robin for select queries on each node.
After 2 minutes playing the scenario, it seems that the leader of a shard can’t
send an update to a replica.
REPLICA : 51
{quote}
2016-01-27 15:18:00.252 WARN (qtp1057941451-128982) [ ] o.e.j.h.HttpParser
badMessage: java.lang.IllegalStateException: too much data after closed for
HttpChannelOverHttp@428e80f{r=1098,c=false,a=IDLE,uri=-}
{quote}
LEADER : 33
{quote}
2016-01-27 15:18:00.245 ERROR
(updateExecutor-2-thread-3557-processing-http:////solrNode051:8983//solr//offers_full_shard3_replica1
x:offers_full_shard3_replica3 r:core_node10 n:solrNode033:8983_solr s:shard3
c:offers_full) [c:offers_full s:shard3 r:core_node10
x:offers_full_shard3_replica3] o.a.s.u.StreamingSolrClients error
org.apache.http.NoHttpResponseException: solrNode051:8983 failed to respond
2016-01-27 15:18:00.267 INFO (qtp1057941451-45551) [c:offers_full s:shard3
r:core_node10 x:offers_full_shard3_replica3] o.a.s.c.ZkController Put replica
core=offers_full_shard3_replica1 coreNodeName=core_node3 on
solrNode051:8983_solr into leader-initiated recovery.
{quote}
REPLICA : 51
{quote}
2016-01-27 15:18:00.276 INFO (qtp1057941451-129018) [ ]
o.a.s.h.a.CoreAdminHandler It has been requested that we recover:
core=offers_full_shard3_replica1
2016-01-27 15:18:00.277 INFO (Thread-117036) [c:offers_full s:shard3
r:core_node3 x:offers_full_shard3_replica1] o.a.s.u.DefaultSolrCoreState
Running recovery - first canceling any ongoing recovery
2016-01-27 15:18:08.296 INFO
(updateExecutor-2-thread-4594-processing-n:solrNode051:8983_solr
x:offers_full_shard3_replica1 s:shard3 c:offers_full r:core_node3)
[c:offers_full s:shard3 r:core_node3 x:offers_full_shard3_replica1]
o.a.s.c.RecoveryStrategy Attempting to PeerSync from
http://solrNode033:8983/solr/offers_full_shard3_replica3/ -
recoveringAfterStartup=false
2016-01-27 15:18:08.296 ERROR
(updateExecutor-2-thread-4594-processing-n:solrNode051:8983_solr
x:offers_full_shard3_replica1 s:shard3 c:offers_full r:core_node3)
[c:offers_full s:shard3 r:core_node3 x:offers_full_shard3_replica1]
o.a.s.u.PeerSync PeerSync: core=offers_full_shard3_replica1
url=http://solrNode051:8983/solr ERROR, update log not in ACTIVE or REPLAY
state. FSUpdateLog{state=BUFFERING,
tlog=tlog{file=/var/lib/solr5/data/offers_full_shard3_replica1/data/tlog/tlog.0000000000000004550
refcount=1}}
2016-01-27 15:18:08.297 WARN
(updateExecutor-2-thread-4594-processing-n:solrNode051:8983_solr
x:offers_full_shard3_replica1 s:shard3 c:offers_full r:core_node3)
[c:offers_full s:shard3 r:core_node3 x:offers_full_shard3_replica1]
o.a.s.u.PeerSync PeerSync: core=offers_full_shard3_replica1
url=http://solrNode051:8983/solr too many updates received since start -
startingUpdates no longer overlaps with our currentUpdates
2016-01-27 15:18:15.160 WARN
(updateExecutor-2-thread-4594-processing-n:solrNode051:8983_solr
x:offers_full_shard3_replica1 s:shard3 c:offers_full r:core_node3)
[c:offers_full s:shard3 r:core_node3 x:offers_full_shard3_replica1]
o.a.s.h.IndexFetcher File _d2r_1wk.liv did not match. expected checksum is
3620017595 and actual is checksum 442433238. expected length is 38405 and
actual length is 38405
2016-01-27 15:18:15.160 INFO
(updateExecutor-2-thread-4594-processing-n:solrNode051:8983_solr
x:offers_full_shard3_replica1 s:shard3 c:offers_full r:core_node3)
[c:offers_full s:shard3 r:core_node3 x:offers_full_shard3_replica1]
o.a.s.h.IndexFetcher Starting download (fullCopy=true) to
NRTCachingDirectory(MMapDirectory@/var/lib/solr5/data/offers_full_shard3_replica1/data/index.20160127151815143
lockFactory=org.apache.lucene.store.NativeFSLockFactory@721552d5;
maxCacheMB=48.0 maxMergeSizeMB=4.0)
{quote}
REPLICA : 27
{quote}
2016-01-27 15:20:54.009 INFO
(zkCallback-3-thread-198-processing-n:solrNode027:8983_solr) [ ]
o.a.s.c.c.ZkStateReader A live node change: WatchedEvent state:Sync
Connected type:NodeChildrenChanged path:/live_nodes, has occurred - updating...
(live nodes size: 11)
2016-01-27 15:20:54.016 INFO
(zkCallback-3-thread-198-processing-n:solrNode027:8983_solr) [c:offers_full
s:shard3 r:core_node7 x:offers_full_shard3_replica2]
o.a.s.c.ShardLeaderElectionContext Running the leader process for shard shard3
{quote}
REPLICA : 51
{quote}
2016-01-27 15:20:54.011 INFO
(zkCallback-3-thread-110-processing-n:solrNode051:8983_solr) [c:offers_full
s:shard3 r:core_node3 x:offers_full_shard3_replica1]
o.a.s.u.DefaultSolrCoreState Running recovery - first canceling any ongoing
recovery
{quote}
REPLICA : 27
{quote}
BECAME THE LEADER
{quote}
{quote}
2016-01-27 15:20:56.525 INFO
(zkCallback-3-thread-198-processing-n:solrNode027:8983_solr) [c:offers_full
s:shard3 r:core_node7 x:offers_full_shard3_replica2] o.a.s.c.SyncStrategy Sync
Success - now sync replicas to me
2016-01-27 15:21:08.793 ERROR (qtp1057941451-43187) [c:offers_full s:shard3
r:core_node7 x:offers_full_shard3_replica2]
o.a.s.u.p.DistributedUpdateProcessor Request says it is coming from leader, but
we are the leader:
update.distrib=FROMLEADER&distrib.from=http://solrNode033:8983/solr/offers_full_shard3_replica3/&wt=javabin&version=2
{quote}
(OLD) LEADER : 33
{quote}
2016-01-27 15:21:08.801 ERROR (qtp1057941451-45871) [c:offers_full s:shard3
r:core_node10 x:offers_full_shard3_replica3]
o.a.s.u.p.DistributedUpdateProcessor On core_node10, replica
http://solrNode027:8983/solr/offers_full_shard3_replica2/ now thinks it is the
leader! Failing the request to let the client retry!
org.apache.solr.common.SolrException:Service Unavailable
2016-01-27 15:21:09.712 INFO
(zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread) [ ]
o.a.s.c.c.ConnectionManager Our previous ZooKeeper session was expired.
Attempting to reconnect to recover relationship with ZooKeeper...
2016-01-27 15:21:09.747 INFO
(zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread)
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3]
o.a.s.c.ZkController publishing state=down
2016-01-27 15:21:09.760 INFO
(zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread)
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3]
o.a.s.c.ZkController Replica core_node10 NOT in leader-initiated recovery, need
to wait for leader to see down state.
{quote}
(NEW) LEADER : 27
{quote}
2016-01-27 15:21:10.137 INFO
(zkCallback-3-thread-199-processing-n:solrNode027:8983_solr) [ ]
o.a.s.c.c.ZkStateReader A cluster state change: WatchedEvent
state:SyncConnected type:NodeDataChanged
path:/collections/offers_full/state.json for collection offers_full has
occurred - updating... (live nodes size: 10)
2016-01-27 15:21:10.139 INFO
(zkCallback-3-thread-199-processing-n:solrNode027:8983_solr) [ ]
o.a.s.c.c.ZkStateReader Updating data for offers_full from 1344 to 1345
2016-01-27 15:21:10.215 INFO (qtp1057941451-43418) [ ]
o.a.s.h.a.CoreAdminHandler Going to wait for coreNodeName: core_node10, state:
down, checkLive: null, onlyIfLeader: null, onlyIfLeaderActive: null
{quote}
(OLD) LEADER : 33
{quote}
2016-01-27 15:21:10.220 INFO
(zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread)
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3]
o.a.s.c.ElectionContext cancelElection did not find election node to remove
/overseer_elect/election/238989758703863355-solrNode033:8983_solr-n_0000000158
2016-01-27 15:21:10.221 INFO
(zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread)
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3]
o.a.s.c.LeaderElector Joined leadership election with path:
/overseer_elect/election/238989758703863364-solrNode033:8983_solr-n_0000000175
2016-01-27 15:21:10.238 INFO
(zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread)
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3]
o.a.s.c.ZkController Register node as live in
ZooKeeper:/live_nodes/solrNode033:8983_solr
2016-01-27 15:21:10.290 INFO
(updateExecutor-2-thread-3542-processing-n:solrNode033:8983_solr
x:offers_full_shard3_replica3 s:shard3 c:offers_full r:core_node10)
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3]
o.a.s.c.RecoveryStrategy Sending prep recovery command to
http://solrNode027:8983/solr; WaitForState:
action=PREPRECOVERY&core=offers_full_shard3_replica2&nodeName=solrNode033:8983_solr&coreNodeName=core_node10&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true
{quote}
(NEW) LEADER : 27
{quote}
2016-01-27 15:21:10.329 INFO (qtp1057941451-43519) [ ] o.a.s.s.HttpSolrCall
[admin] webapp=null path=/admin/cores
params={nodeName=solrNode033:8983_solr&onlyIfLeaderActive=true&core=offers_full_shard3_replica2&coreNodeName=core_node10&action=PREPRECOVERY&checkLive=true&state=recovering&onlyIfLeader=true&wt=javabin&version=2}
status=0 QTime=2
{quote}
(OLD) LEADER : 33
{quote}
Process Recovering
{quote}
{quote}
2016-01-27 15:21:52.007 INFO
(zkCallback-3-thread-211-processing-n:solrNode033:8983_solr) [ ]
o.a.s.c.c.ZkStateReader A live node change: WatchedEvent state:SyncConnected
type:NodeChildrenChanged path:/live_nodes, has occurred - updating... (live
nodes size: 11)
2016-01-27 15:21:52.008 INFO
(zkCallback-3-thread-210-processing-n:solrNode033:8983_solr) [c:offers_full
s:shard3 r:core_node10 x:offers_full_shard3_replica3] o.a.s.c.ActionThrottle
The last leader attempt started 22366010ms ago.
2016-01-27 15:21:52.008 INFO
(zkCallback-3-thread-210-processing-n:solrNode033:8983_solr) [c:offers_full
s:shard3 r:core_node10 x:offers_full_shard3_replica3]
o.a.s.c.ShardLeaderElectionContext Running the leader process for shard shard3
2016-01-27 15:21:52.007 INFO
(zkCallback-3-thread-211-processing-n:solrNode033:8983_solr) [ ]
o.a.s.c.c.ZkStateReader A live node change: WatchedEvent state:SyncConnected
type:NodeChildrenChanged path:/live_nodes, has occurred - updating... (live
nodes size: 11)
2016-01-27 15:21:52.011 INFO
(zkCallback-3-thread-210-processing-n:solrNode033:8983_solr) [c:offers_full
s:shard3 r:core_node10 x:offers_full_shard3_replica3]
o.a.s.c.ShardLeaderElectionContext Checking if I should try and be the leader.
2016-01-27 15:21:52.011 INFO
(zkCallback-3-thread-210-processing-n:solrNode033:8983_solr) [c:offers_full
s:shard3 r:core_node10 x:offers_full_shard3_replica3]
o.a.s.c.ShardLeaderElectionContext There may be a better leader candidate than
us - going back into recovery
{quote}
(NEW) LEADER : 27
{quote}
2016-01-27 15:21:53.211 INFO
(zkCallback-3-thread-73-processing-n:solrNode027:8983_solr-EventThread) [ ]
o.a.s.c.c.ConnectionManager Watcher
org.apache.solr.common.cloud.ConnectionManager@478ca953
name:ZooKeeperConnection
Watcher:solrNode013:2181,solrNode014:2181,solrNode015:2181 got event
WatchedEvent state:Disconnected type:None path:null path:null type:None
2016-01-27 15:21:53.211 INFO
(zkCallback-3-thread-73-processing-n:solrNode027:8983_solr-EventThread) [ ]
o.a.s.c.c.ConnectionManager zkClient has disconnected
{quote}
All this time, the replica 51 can't catch up and stay in recovering state.
The old leader 33 stays in full recovering state because 27 is the new leader.
During several minutes, the node 33 tries to download all the segments but fail
on every checksum ("expected length is 137136 and actual length is 2096582" for
example)
Because it cannot keep its zookeeper session, the 27 stays down because there
are no leader for the shard...
This behaviour happens every time I send too many QPS... Under 40 qps, we do
not have any problems...
Does it look similar to these ChaosMonkey failures ?
> HdfsChaosMonkeyNothingIsSafeTest failures
> -----------------------------------------
>
> Key: SOLR-8129
> URL: https://issues.apache.org/jira/browse/SOLR-8129
> Project: Solr
> Issue Type: Bug
> Reporter: Yonik Seeley
> Attachments: fail.151005_064958, fail.151005_080319
>
>
> New HDFS chaos test in SOLR-8123 hits a number of types of failures,
> including shard inconsistency.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]