[jira] [Commented] (SOLR-11702) Redesign current LIR implementation

Cao Manh Dat (JIRA) Sun, 14 Jan 2018 21:53:22 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16325907#comment-16325907
 ]


Cao Manh Dat commented on SOLR-11702:
-------------------------------------

Thanks [~shalinmangar]
{quote}
1. DUP.setupRequest skips replicas having terms. If I understand correctly, 
this will mean that updates are no longer forwarded to replicas until they 
publish themselves in recovery? Is that right?
{quote}
Right, if term of a replica is less than leader term, leader will stop sending 
updates to that replica.

{quote}
2. CreateCollectionCmd – throw InterruptedException directly from the method 
instead of trying to handle it here
{quote}
The code of deleting old term nodes in CreateCollectionCmd is handled exactly 
same as the code below it, I do not understand the problem here.

{quote}
3. Mark LIR related classes/methods as deprecated – those are more likely to 
get attention right before 8.0 I think.
{quote}
Sure, this is a good idea

{quote}
5. RecoveringCoreTermWatcher – Shouldn't lastTermDoRecovery be set after 
recovery completes? If not, how do we ensure that recoveries are stacked up?
{quote}
I do not see any problem in the current implementation, after we call 
{{doRecovery}}, the recovery process will start shortly

{quote}
6. RecoveringCoreTermWatcher catches NullPointerException. Do a null check 
instead.
{quote}
Sure!

{quote}
7. RecoveryStrategy – why pingLeader? isn't it sufficient to use 
ZkStateReader.getLeaderRetry as we used to do earlier?
{quote}
Imagine this case, when there are network partition between leader and replica
* Leader increase term of replica
* RecoveringCoreTermWatcher trigger recovery process of replica, replica goes 
into recovery ( hence increase its term )
* Leader increase term of replica ( because it failed to send update to replica 
and now term of replica is equals to leader's term)
* RecoveringCoreTermWatcher trigger recovery process of replica, replica goes 
into recovery ( hence increase its term )
* ... this process will be repeated forever until the network is healed

{quote}
8. ZkCollectionTerms – if getShard and remove methods need to be synchronized 
then seems like close can interfere. Perhaps better to synchronize on the terms 
map itself.
{quote}
This is a good idea

{quote}
9. Can you explain the purpose of "new".equals(cd.getCoreProperty("lirVersion", 
"new"))) used in various places?
{quote}
That flag mostly used for testing rolling updates and can be removed in 
SOLR-11812

> Redesign current LIR implementation
> -----------------------------------
>
>                 Key: SOLR-11702
>                 URL: https://issues.apache.org/jira/browse/SOLR-11702
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Cao Manh Dat
>            Assignee: Cao Manh Dat
>            Priority: Major
>         Attachments: SOLR-11702.patch, SOLR-11702.patch
>
>
> I recently looked into some problem related to racing between LIR and 
> Recovering. I would like to propose a totally new approach to solve SOLR-5495 
> problem because fixing current implementation by a bandage will lead us to 
> other problems (we can not prove the correctness of the implementation).
> Feel free to give comments/thoughts about this new scheme.
> https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-11702) Redesign current LIR implementation

Reply via email to