[ 
https://issues.apache.org/jira/browse/SOLR-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated SOLR-10398:
----------------------------------
    Description: 
I've seen a scenario where multiple LIRs happen around the same time.
In this case even if PeerSync succeeded we ended up failing causing a full 
index fetch.

Sequence of events:
T1: Leader puts replica in LIR and replica's LIRState as DOWN
T2: Replica begins PeerSync and LIRState changes
T3: Leader puts replica in LIR again and replica's LIRState is set to DOWN
T4: PeerSync from T1 succeeds and examines it's own LIRState which is now DOWN 
and fails triggering a full replication

Log snippet

T1 from the Leader Logs
{code}
solr.log.2:12779:2017-03-23 03:03:18.706 INFO  (qtp1076677520-9812) [c:test 
s:shard73 r:core_node44 x:test_shard73_replica1] o.a.s.c.ZkController Put 
replica core=test_shard73_replica2 coreNodeName=core_node247 on 
server:8993_solr into leader-initiated recovery.
{code}

T2 from the replica logs:
{code}
solr.log.1:2017-03-23 03:03:26.724 INFO  (RecoveryThread-test_shard73_replica2) 
[c:test s:shard73 r:core_node247 x:test_shard73_replica2] 
o.a.s.c.RecoveryStrategy Attempting to PeerSync from 
http://server:8983/solr/test_shard73_replica1/ - recoveringAfterStartup=false
{code}

T3 from the Leader Logs
{code}
solr.log.2:2017-03-23 03:03:43.268 INFO  (qtp1076677520-9796) [c:test s:shard73 
r:core_node44 x:test_shard73_replica1] o.a.s.c.ZkController Put replica 
core=test_shard73_replica2 coreNodeName=core_node247 on server:8993_solr into 
leader-initiated recovery.
{code}

T4 from the replica logs:
{code}
2017-03-23 03:05:38.009 INFO  (RecoveryThread-test_shard73_replica2) [c:test 
s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.c.RecoveryStrategy 
PeerSync Recovery was successful - registering as Active.
2017-03-23 03:05:38.012 ERROR (RecoveryThread-test_shard73_replica2) [c:test 
s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.c.RecoveryStrategy 
Error while trying to recover.:org.apache.solr.common.SolrException: Cannot 
publish state of core 'test_shard73_replica2' as active without recovering 
first!
 at org.apache.solr.cloud.ZkController.publish(ZkController.java:1179)
 at org.apache.solr.cloud.ZkController.publish(ZkController.java:1135)
 at org.apache.solr.cloud.ZkController.publish(ZkController.java:1131)
 at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:415)
 at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)

 2017-03-23 03:05:47.014 INFO  (RecoveryThread-test_shard73_replica2) [c:test 
s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.h.IndexFetcher Starting 
download to 
NRTCachingDirectory(MMapDirectory@/data4/test_shard73_replica2/data/index.20170323030546697
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@4aa1e5c0; 
maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true
{code}

I don't know whats the best approach to tackle the problem is but I'll post 
suggestions after doing some research. I wanted to create the Jira to track the 
issue

  was:
I've seen a scenario where multiple LIRs happen around the same time.
In this case even if PeerSync succeeded we ended up failing causing a full 
index fetch.

Sequence of events:
T1: Leader puts replica in LIR and replica's LIRState as DOWN
T2: Replica begins PeerSync and LIRState changes
T3: Leader puts replica in LIR again and replica's LIRState is set to DOWN
T4: PeerSync from T1 succeeds and examines it's own LIRState which is now DOWN 
and fails

Log snippet

T1 from the Leader Logs
{code}
solr.log.2:12779:2017-03-23 03:03:18.706 INFO  (qtp1076677520-9812) [c:test 
s:shard73 r:core_node44 x:test_shard73_replica1] o.a.s.c.ZkController Put 
replica core=test_shard73_replica2 coreNodeName=core_node247 on 
server:8993_solr into leader-initiated recovery.
{code}

T2 from the replica logs:
{code}
solr.log.1:2017-03-23 03:03:26.724 INFO  (RecoveryThread-test_shard73_replica2) 
[c:test s:shard73 r:core_node247 x:test_shard73_replica2] 
o.a.s.c.RecoveryStrategy Attempting to PeerSync from 
http://server:8983/solr/test_shard73_replica1/ - recoveringAfterStartup=false
{code}

T3 from the Leader Logs
{code}
solr.log.2:2017-03-23 03:03:43.268 INFO  (qtp1076677520-9796) [c:test s:shard73 
r:core_node44 x:test_shard73_replica1] o.a.s.c.ZkController Put replica 
core=test_shard73_replica2 coreNodeName=core_node247 on server:8993_solr into 
leader-initiated recovery.
{code}

T4 from the replica logs:
{code}
2017-03-23 03:05:38.009 INFO  (RecoveryThread-test_shard73_replica2) [c:test 
s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.c.RecoveryStrategy 
PeerSync Recovery was successful - registering as Active.
2017-03-23 03:05:38.012 ERROR (RecoveryThread-test_shard73_replica2) [c:test 
s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.c.RecoveryStrategy 
Error while trying to recover.:org.apache.solr.common.SolrException: Cannot 
publish state of core 'test_shard73_replica2' as active without recovering 
first!
 at org.apache.solr.cloud.ZkController.publish(ZkController.java:1179)
 at org.apache.solr.cloud.ZkController.publish(ZkController.java:1135)
 at org.apache.solr.cloud.ZkController.publish(ZkController.java:1131)
 at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:415)
 at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)

 2017-03-23 03:05:47.014 INFO  (RecoveryThread-test_shard73_replica2) [c:test 
s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.h.IndexFetcher Starting 
download to 
NRTCachingDirectory(MMapDirectory@/data4/test_shard73_replica2/data/index.20170323030546697
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@4aa1e5c0; 
maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true
{code}

I don't know whats the best approach to tackle the problem is but I'll post 
suggestions after doing some research. I wanted to create the Jira to track the 
issue


> Multiple LIR requests can fail PeerSync even if it succeeds
> -----------------------------------------------------------
>
>                 Key: SOLR-10398
>                 URL: https://issues.apache.org/jira/browse/SOLR-10398
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Varun Thacker
>
> I've seen a scenario where multiple LIRs happen around the same time.
> In this case even if PeerSync succeeded we ended up failing causing a full 
> index fetch.
> Sequence of events:
> T1: Leader puts replica in LIR and replica's LIRState as DOWN
> T2: Replica begins PeerSync and LIRState changes
> T3: Leader puts replica in LIR again and replica's LIRState is set to DOWN
> T4: PeerSync from T1 succeeds and examines it's own LIRState which is now 
> DOWN and fails triggering a full replication
> Log snippet
> T1 from the Leader Logs
> {code}
> solr.log.2:12779:2017-03-23 03:03:18.706 INFO  (qtp1076677520-9812) [c:test 
> s:shard73 r:core_node44 x:test_shard73_replica1] o.a.s.c.ZkController Put 
> replica core=test_shard73_replica2 coreNodeName=core_node247 on 
> server:8993_solr into leader-initiated recovery.
> {code}
> T2 from the replica logs:
> {code}
> solr.log.1:2017-03-23 03:03:26.724 INFO  
> (RecoveryThread-test_shard73_replica2) [c:test s:shard73 r:core_node247 
> x:test_shard73_replica2] o.a.s.c.RecoveryStrategy Attempting to PeerSync from 
> http://server:8983/solr/test_shard73_replica1/ - recoveringAfterStartup=false
> {code}
> T3 from the Leader Logs
> {code}
> solr.log.2:2017-03-23 03:03:43.268 INFO  (qtp1076677520-9796) [c:test 
> s:shard73 r:core_node44 x:test_shard73_replica1] o.a.s.c.ZkController Put 
> replica core=test_shard73_replica2 coreNodeName=core_node247 on 
> server:8993_solr into leader-initiated recovery.
> {code}
> T4 from the replica logs:
> {code}
> 2017-03-23 03:05:38.009 INFO  (RecoveryThread-test_shard73_replica2) [c:test 
> s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.c.RecoveryStrategy 
> PeerSync Recovery was successful - registering as Active.
> 2017-03-23 03:05:38.012 ERROR (RecoveryThread-test_shard73_replica2) [c:test 
> s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.c.RecoveryStrategy 
> Error while trying to recover.:org.apache.solr.common.SolrException: Cannot 
> publish state of core 'test_shard73_replica2' as active without recovering 
> first!
>  at org.apache.solr.cloud.ZkController.publish(ZkController.java:1179)
>  at org.apache.solr.cloud.ZkController.publish(ZkController.java:1135)
>  at org.apache.solr.cloud.ZkController.publish(ZkController.java:1131)
>  at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:415)
>  at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
>  2017-03-23 03:05:47.014 INFO  (RecoveryThread-test_shard73_replica2) [c:test 
> s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.h.IndexFetcher 
> Starting download to 
> NRTCachingDirectory(MMapDirectory@/data4/test_shard73_replica2/data/index.20170323030546697
>  lockFactory=org.apache.lucene.store.NativeFSLockFactory@4aa1e5c0; 
> maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true
> {code}
> I don't know whats the best approach to tackle the problem is but I'll post 
> suggestions after doing some research. I wanted to create the Jira to track 
> the issue



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to