[jira] [Commented] (SOLR-12969) Solr replication failure

2018-11-12 Thread Cao Manh Dat (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683475#comment-16683475
 ] 

Cao Manh Dat commented on SOLR-12969:
-

Actually, I have another idea for not relying on tlog for PeerSync. This idea 
requires 2 changes
* Versions should be monotonic increased (instead of relying on clock time).
* Each commit point will contains the compressed format of all the updates 
versions had been applied. I.E: 1..100,103..1000. 
* Therefore based on above changes, we can quickly compare version that we 
missed without caring about compute fingerprint or relying on tlog. 
* We still need to think about cases with FORCELEADER or similar cases, i.e: 
different updates have same update version.

But this approach requires a lot of changes and I'm not sure it is worth or 
not. What we gain here by going with above approach is
* Faster and more reliable PeerSync process.
* We don't have to compute fingerprint any more.
May be one or two more cases will convince me on switching to above approach. 
Any thoughts? [~shalinmangar] [~markrmil...@gmail.com] [~ysee...@gmail.com]

> Solr replication failure
> 
>
> Key: SOLR-12969
> URL: https://issues.apache.org/jira/browse/SOLR-12969
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: replication (java)
>Reporter: Jeremy Smith
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12969.patch, SOLR-12969.patch, SOLR-12969.patch
>
>
> Under certain circumstances, replication fails between a leader and follower. 
>  The follower will not receive updates from the leader, even though the 
> leader has a newer version.  If the leader is restarted, it will get the 
> older version from the follower.
>  
> This was discussed on the [mailing 
> list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201810.mbox/%3CBYAPR04MB4406710795EA07E93BF80913ADCD0%40BYAPR04MB4406.namprd04.prod.outlook.com%3E]
>  and [~risdenk] [wrote a 
> script|https://github.com/risdenk/test-solr-start-stop-replica-consistency] 
> that demonstrates this error.  He also verified that the error occurs when 
> the script is run outside of docker.
>  
> Here is the scenario of the failure:
>  * A collection with 1 shards and 2 replicas
>  * Stop non-leader replica (B)
>  * Index more than 100 documents to the collection
>  * Start replica B, it failed to do PeerSync and starts segments replication
>  * Index document 101th to the collection
>  ** Leader's tlog: [1, 2, 3, ..., 100, 101]
>  ** Replica's tlog: [101]
>  * Stop replica B
>  * Index document 102th to the collection
>  * Start replica B, on doing PeerSync
>  ** Leader's tlog: [1, 2, 3, ..., 100, 101, 102]
>  ** Replica's tlog: [101]
>  ** Leader's high (80th): 80
>  ** Replica's low: 101
>  ** By comparison: replica's low > leader's high => ALREADY_IN_SYNC



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12969) Solr replication failure

2018-11-11 Thread Cao Manh Dat (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683323#comment-16683323
 ] 

Cao Manh Dat commented on SOLR-12969:
-

There are actually 2 problems that I think we should fix here:
 * Since PeerSyncWithLeader is the case replica sync with its leader, the 
result should not be ALREADY_IN_SYNC in case of fingerprint comparison failed.
 * PeerSync check for already_in_sync should be more robust "{{ourLowThreshold 
> otherHigh && ourHighest >= otherHighest}}" insteads of "{{ourLowThreshold > 
otherHigh}}. 

 

> Solr replication failure
> 
>
> Key: SOLR-12969
> URL: https://issues.apache.org/jira/browse/SOLR-12969
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: replication (java)
>Reporter: Jeremy Smith
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12969.patch, SOLR-12969.patch
>
>
> Under certain circumstances, replication fails between a leader and follower. 
>  The follower will not receive updates from the leader, even though the 
> leader has a newer version.  If the leader is restarted, it will get the 
> older version from the follower.
>  
> This was discussed on the [mailing 
> list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201810.mbox/%3CBYAPR04MB4406710795EA07E93BF80913ADCD0%40BYAPR04MB4406.namprd04.prod.outlook.com%3E]
>  and [~risdenk] [wrote a 
> script|https://github.com/risdenk/test-solr-start-stop-replica-consistency] 
> that demonstrates this error.  He also verified that the error occurs when 
> the script is run outside of docker.
>  
> Here is the scenario of the failure:
>  * A collection with 1 shards and 2 replicas
>  * Stop non-leader replica (B)
>  * Index more than 100 documents to the collection
>  * Start replica B, it failed to do PeerSync and starts segments replication
>  * Index document 101th to the collection
>  ** Leader's tlog: [1, 2, 3, ..., 100, 101]
>  ** Replica's tlog: [101]
>  * Stop replica B
>  * Index document 102th to the collection
>  * Start replica B, on doing PeerSync
>  ** Leader's tlog: [1, 2, 3, ..., 100, 101, 102]
>  ** Replica's tlog: [101]
>  ** Leader's high (80th): 80
>  ** Replica's low: 101
>  ** By comparison: replica's low > leader's high => ALREADY_IN_SYNC



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12969) Solr replication failure

2018-11-08 Thread Cao Manh Dat (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679971#comment-16679971
 ] 

Cao Manh Dat commented on SOLR-12969:
-

I confirm that this problem happens on different versions of Solr (maybe from 
the start since this is problem of PeerSync logic). 

Attached patch here that fix the problem, but I really don't like the way we 
current do PeerSync. I may think about this for couple of days before 
committing the patch.

P/S: The scenario when this failure happen is tricky and this bug appear with 
the appearance of SolrCloud so we don't have to rush about it.

> Solr replication failure
> 
>
> Key: SOLR-12969
> URL: https://issues.apache.org/jira/browse/SOLR-12969
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: replication (java)
>Reporter: Jeremy Smith
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12969.patch
>
>
> Under certain circumstances, replication fails between a leader and follower. 
>  The follower will not receive updates from the leader, even though the 
> leader has a newer version.  If the leader is restarted, it will get the 
> older version from the follower.
>  
> This was discussed on the [mailing 
> list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201810.mbox/%3CBYAPR04MB4406710795EA07E93BF80913ADCD0%40BYAPR04MB4406.namprd04.prod.outlook.com%3E]
>  and [~risdenk] [wrote a 
> script|https://github.com/risdenk/test-solr-start-stop-replica-consistency] 
> that demonstrates this error.  He also verified that the error occurs when 
> the script is run outside of docker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12969) Solr replication failure

2018-11-06 Thread Kevin Risden (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677277#comment-16677277
 ] 

Kevin Risden commented on SOLR-12969:
-

[~caomanhdat] - Do you have any ideas here? I think you did work on replication.

> Solr replication failure
> 
>
> Key: SOLR-12969
> URL: https://issues.apache.org/jira/browse/SOLR-12969
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: replication (java)
>Reporter: Jeremy Smith
>Priority: Major
>
> Under certain circumstances, replication fails between a leader and follower. 
>  The follower will not receive updates from the leader, even though the 
> leader has a newer version.  If the leader is restarted, it will get the 
> older version from the follower.
>  
> This was discussed on the [mailing 
> list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201810.mbox/%3CBYAPR04MB4406710795EA07E93BF80913ADCD0%40BYAPR04MB4406.namprd04.prod.outlook.com%3E]
>  and [~risdenk] [wrote a 
> script|https://github.com/risdenk/test-solr-start-stop-replica-consistency] 
> that demonstrates this error.  He also verified that the error occurs when 
> the script is run outside of docker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org