[jira] [Commented] (SOLR-8129) HdfsChaosMonkeyNothingIsSafeTest failures

JIRA Wed, 27 Jan 2016 10:10:11 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119905#comment-15119905
 ]


Jérémie MONSINJON commented on SOLR-8129:
-----------------------------------------

I’m stressing a SolR cluster with Gatling for perf tests. 3 shards, 3 nodes per 
shard (1 leader, 2 replicas). 
About 300 updates per second using CloudSolrServer (about 100 per shard) and 40 
qps.
Using round robin for select queries on each node.
After 2 minutes playing the scenario, it seems that the leader of a shard can’t 
send an update to a replica.

REPLICA : 51
{quote}
2016-01-27 15:18:00.252 WARN  (qtp1057941451-128982) [   ] o.e.j.h.HttpParser 
badMessage: java.lang.IllegalStateException: too much data after closed for 
HttpChannelOverHttp@428e80f{r=1098,c=false,a=IDLE,uri=-}
{quote}

LEADER : 33
{quote}
2016-01-27 15:18:00.245 ERROR 
(updateExecutor-2-thread-3557-processing-http:////solrNode051:8983//solr//offers_full_shard3_replica1
 x:offers_full_shard3_replica3 r:core_node10 n:solrNode033:8983_solr s:shard3 
c:offers_full) [c:offers_full s:shard3 r:core_node10 
x:offers_full_shard3_replica3] o.a.s.u.StreamingSolrClients error
org.apache.http.NoHttpResponseException: solrNode051:8983 failed to respond
2016-01-27 15:18:00.267 INFO  (qtp1057941451-45551) [c:offers_full s:shard3 
r:core_node10 x:offers_full_shard3_replica3] o.a.s.c.ZkController Put replica 
core=offers_full_shard3_replica1 coreNodeName=core_node3 on 
solrNode051:8983_solr into leader-initiated recovery.
{quote}

REPLICA : 51
{quote}
2016-01-27 15:18:00.276 INFO  (qtp1057941451-129018) [   ] 
o.a.s.h.a.CoreAdminHandler It has been requested that we recover: 
core=offers_full_shard3_replica1
2016-01-27 15:18:00.277 INFO  (Thread-117036) [c:offers_full s:shard3 
r:core_node3 x:offers_full_shard3_replica1] o.a.s.u.DefaultSolrCoreState 
Running recovery - first canceling any ongoing recovery

2016-01-27 15:18:08.296 INFO  
(updateExecutor-2-thread-4594-processing-n:solrNode051:8983_solr 
x:offers_full_shard3_replica1 s:shard3 c:offers_full r:core_node3) 
[c:offers_full s:shard3 r:core_node3 x:offers_full_shard3_replica1] 
o.a.s.c.RecoveryStrategy Attempting to PeerSync from 
http://solrNode033:8983/solr/offers_full_shard3_replica3/ - 
recoveringAfterStartup=false

2016-01-27 15:18:08.296 ERROR 
(updateExecutor-2-thread-4594-processing-n:solrNode051:8983_solr 
x:offers_full_shard3_replica1 s:shard3 c:offers_full r:core_node3) 
[c:offers_full s:shard3 r:core_node3 x:offers_full_shard3_replica1] 
o.a.s.u.PeerSync PeerSync: core=offers_full_shard3_replica1 
url=http://solrNode051:8983/solr ERROR, update log not in ACTIVE or REPLAY 
state. FSUpdateLog{state=BUFFERING, 
tlog=tlog{file=/var/lib/solr5/data/offers_full_shard3_replica1/data/tlog/tlog.0000000000000004550
 refcount=1}}
2016-01-27 15:18:08.297 WARN  
(updateExecutor-2-thread-4594-processing-n:solrNode051:8983_solr 
x:offers_full_shard3_replica1 s:shard3 c:offers_full r:core_node3) 
[c:offers_full s:shard3 r:core_node3 x:offers_full_shard3_replica1] 
o.a.s.u.PeerSync PeerSync: core=offers_full_shard3_replica1 
url=http://solrNode051:8983/solr too many updates received since start - 
startingUpdates no longer overlaps with our currentUpdates

2016-01-27 15:18:15.160 WARN  
(updateExecutor-2-thread-4594-processing-n:solrNode051:8983_solr 
x:offers_full_shard3_replica1 s:shard3 c:offers_full r:core_node3) 
[c:offers_full s:shard3 r:core_node3 x:offers_full_shard3_replica1] 
o.a.s.h.IndexFetcher File _d2r_1wk.liv did not match. expected checksum is 
3620017595 and actual is checksum 442433238. expected length is 38405 and 
actual length is 38405
2016-01-27 15:18:15.160 INFO  
(updateExecutor-2-thread-4594-processing-n:solrNode051:8983_solr 
x:offers_full_shard3_replica1 s:shard3 c:offers_full r:core_node3) 
[c:offers_full s:shard3 r:core_node3 x:offers_full_shard3_replica1] 
o.a.s.h.IndexFetcher Starting download (fullCopy=true) to 
NRTCachingDirectory(MMapDirectory@/var/lib/solr5/data/offers_full_shard3_replica1/data/index.20160127151815143
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@721552d5; 
maxCacheMB=48.0 maxMergeSizeMB=4.0)
{quote}

REPLICA : 27
{quote}
2016-01-27 15:20:54.009 INFO  
(zkCallback-3-thread-198-processing-n:solrNode027:8983_solr) [   ] 
o.a.s.c.c.ZkStateReader A live node change: WatchedEvent state:Sync
Connected type:NodeChildrenChanged path:/live_nodes, has occurred - updating... 
(live nodes size: 11)
2016-01-27 15:20:54.016 INFO  
(zkCallback-3-thread-198-processing-n:solrNode027:8983_solr) [c:offers_full 
s:shard3 r:core_node7 x:offers_full_shard3_replica2] 
o.a.s.c.ShardLeaderElectionContext Running the leader process for shard shard3
{quote}

REPLICA : 51
{quote}
2016-01-27 15:20:54.011 INFO  
(zkCallback-3-thread-110-processing-n:solrNode051:8983_solr) [c:offers_full 
s:shard3 r:core_node3 x:offers_full_shard3_replica1] 
o.a.s.u.DefaultSolrCoreState Running recovery - first canceling any ongoing 
recovery
{quote} 

REPLICA : 27
{quote}
BECAME THE LEADER
{quote}
{quote}
2016-01-27 15:20:56.525 INFO  
(zkCallback-3-thread-198-processing-n:solrNode027:8983_solr) [c:offers_full 
s:shard3 r:core_node7 x:offers_full_shard3_replica2] o.a.s.c.SyncStrategy Sync 
Success - now sync replicas to me

2016-01-27 15:21:08.793 ERROR (qtp1057941451-43187) [c:offers_full s:shard3 
r:core_node7 x:offers_full_shard3_replica2] 
o.a.s.u.p.DistributedUpdateProcessor Request says it is coming from leader, but 
we are the leader: 
update.distrib=FROMLEADER&distrib.from=http://solrNode033:8983/solr/offers_full_shard3_replica3/&wt=javabin&version=2
{quote} 

(OLD) LEADER : 33
{quote}
2016-01-27 15:21:08.801 ERROR (qtp1057941451-45871) [c:offers_full s:shard3 
r:core_node10 x:offers_full_shard3_replica3] 
o.a.s.u.p.DistributedUpdateProcessor On core_node10, replica 
http://solrNode027:8983/solr/offers_full_shard3_replica2/ now thinks it is the 
leader! Failing the request to let the client retry! 
org.apache.solr.common.SolrException:Service Unavailable

2016-01-27 15:21:09.712 INFO  
(zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread) [   ] 
o.a.s.c.c.ConnectionManager Our previous ZooKeeper session was expired. 
Attempting to reconnect to recover relationship with ZooKeeper...

2016-01-27 15:21:09.747 INFO  
(zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread) 
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3] 
o.a.s.c.ZkController publishing state=down
2016-01-27 15:21:09.760 INFO  
(zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread) 
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3] 
o.a.s.c.ZkController Replica core_node10 NOT in leader-initiated recovery, need 
to wait for leader to see down state.
{quote}

(NEW) LEADER : 27
{quote}
2016-01-27 15:21:10.137 INFO  
(zkCallback-3-thread-199-processing-n:solrNode027:8983_solr) [   ] 
o.a.s.c.c.ZkStateReader A cluster state change: WatchedEvent 
state:SyncConnected type:NodeDataChanged 
path:/collections/offers_full/state.json for collection offers_full has 
occurred - updating... (live nodes size: 10)
2016-01-27 15:21:10.139 INFO  
(zkCallback-3-thread-199-processing-n:solrNode027:8983_solr) [   ] 
o.a.s.c.c.ZkStateReader Updating data for offers_full from 1344 to 1345
2016-01-27 15:21:10.215 INFO  (qtp1057941451-43418) [   ] 
o.a.s.h.a.CoreAdminHandler Going to wait for coreNodeName: core_node10, state: 
down, checkLive: null, onlyIfLeader: null, onlyIfLeaderActive: null
{quote}

(OLD) LEADER : 33
{quote}
2016-01-27 15:21:10.220 INFO  
(zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread) 
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3] 
o.a.s.c.ElectionContext cancelElection did not find election node to remove 
/overseer_elect/election/238989758703863355-solrNode033:8983_solr-n_0000000158
2016-01-27 15:21:10.221 INFO  
(zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread) 
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3] 
o.a.s.c.LeaderElector Joined leadership election with path: 
/overseer_elect/election/238989758703863364-solrNode033:8983_solr-n_0000000175
2016-01-27 15:21:10.238 INFO  
(zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread) 
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3] 
o.a.s.c.ZkController Register node as live in 
ZooKeeper:/live_nodes/solrNode033:8983_solr
2016-01-27 15:21:10.290 INFO  
(updateExecutor-2-thread-3542-processing-n:solrNode033:8983_solr 
x:offers_full_shard3_replica3 s:shard3 c:offers_full r:core_node10) 
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3] 
o.a.s.c.RecoveryStrategy Sending prep recovery command to 
http://solrNode027:8983/solr; WaitForState: 
action=PREPRECOVERY&core=offers_full_shard3_replica2&nodeName=solrNode033:8983_solr&coreNodeName=core_node10&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true
{quote} 

(NEW) LEADER : 27
{quote}
2016-01-27 15:21:10.329 INFO  (qtp1057941451-43519) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/cores 
params={nodeName=solrNode033:8983_solr&onlyIfLeaderActive=true&core=offers_full_shard3_replica2&coreNodeName=core_node10&action=PREPRECOVERY&checkLive=true&state=recovering&onlyIfLeader=true&wt=javabin&version=2}
 status=0 QTime=2
{quote} 

(OLD) LEADER : 33
{quote}
Process Recovering
{quote}
{quote}
2016-01-27 15:21:52.007 INFO  
(zkCallback-3-thread-211-processing-n:solrNode033:8983_solr) [   ] 
o.a.s.c.c.ZkStateReader A live node change: WatchedEvent state:SyncConnected 
type:NodeChildrenChanged path:/live_nodes, has occurred - updating... (live 
nodes size: 11)
2016-01-27 15:21:52.008 INFO  
(zkCallback-3-thread-210-processing-n:solrNode033:8983_solr) [c:offers_full 
s:shard3 r:core_node10 x:offers_full_shard3_replica3] o.a.s.c.ActionThrottle 
The last leader attempt started 22366010ms ago.
2016-01-27 15:21:52.008 INFO  
(zkCallback-3-thread-210-processing-n:solrNode033:8983_solr) [c:offers_full 
s:shard3 r:core_node10 x:offers_full_shard3_replica3] 
o.a.s.c.ShardLeaderElectionContext Running the leader process for shard shard3
2016-01-27 15:21:52.007 INFO  
(zkCallback-3-thread-211-processing-n:solrNode033:8983_solr) [   ] 
o.a.s.c.c.ZkStateReader A live node change: WatchedEvent state:SyncConnected 
type:NodeChildrenChanged path:/live_nodes, has occurred - updating... (live 
nodes size: 11)
2016-01-27 15:21:52.011 INFO  
(zkCallback-3-thread-210-processing-n:solrNode033:8983_solr) [c:offers_full 
s:shard3 r:core_node10 x:offers_full_shard3_replica3] 
o.a.s.c.ShardLeaderElectionContext Checking if I should try and be the leader.
2016-01-27 15:21:52.011 INFO  
(zkCallback-3-thread-210-processing-n:solrNode033:8983_solr) [c:offers_full 
s:shard3 r:core_node10 x:offers_full_shard3_replica3] 
o.a.s.c.ShardLeaderElectionContext There may be a better leader candidate than 
us - going back into recovery
{quote} 

(NEW) LEADER : 27
{quote}
2016-01-27 15:21:53.211 INFO  
(zkCallback-3-thread-73-processing-n:solrNode027:8983_solr-EventThread) [   ] 
o.a.s.c.c.ConnectionManager Watcher 
org.apache.solr.common.cloud.ConnectionManager@478ca953 
name:ZooKeeperConnection 
Watcher:solrNode013:2181,solrNode014:2181,solrNode015:2181 got event 
WatchedEvent state:Disconnected type:None path:null path:null type:None
2016-01-27 15:21:53.211 INFO  
(zkCallback-3-thread-73-processing-n:solrNode027:8983_solr-EventThread) [   ] 
o.a.s.c.c.ConnectionManager zkClient has disconnected       
{quote}

All this time, the replica 51 can't catch up and stay in recovering     state.
The old leader 33 stays in full recovering state because 27 is the new leader. 
During several minutes, the node 33 tries to download all the segments but fail 
on every checksum ("expected length is 137136 and actual length is 2096582" for 
example)
Because it cannot keep its zookeeper session, the 27 stays down because there 
are no leader for the shard...

This behaviour happens every time I send too many QPS... Under 40 qps, we do 
not have any problems...

Does it look similar to these ChaosMonkey failures ?

> HdfsChaosMonkeyNothingIsSafeTest failures
> -----------------------------------------
>
>                 Key: SOLR-8129
>                 URL: https://issues.apache.org/jira/browse/SOLR-8129
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Yonik Seeley
>         Attachments: fail.151005_064958, fail.151005_080319
>
>
> New HDFS chaos test in SOLR-8123 hits a number of types of failures, 
> including shard inconsistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-8129) HdfsChaosMonkeyNothingIsSafeTest failures

Reply via email to