[ 
https://issues.apache.org/jira/browse/SOLR-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125589#comment-16125589
 ] 

Amrit Sarkar edited comment on SOLR-11069 at 8/14/17 12:14 PM:
---------------------------------------------------------------

Thank you Erick for clarifying the root cause. I see LPV may very well not be 
the issue we are facing here, pardon my limited testing for this.

Three things I tested on limited schedule to see if bootstrapping is happening 
with Erick's patch on {{branch_6x}}:

1. Restart source and target clusters at different intervals, see if bootstrap 
is happening.
2. On 2x2 source and target collection - clusters, shut down one node / leader 
to get the other nodes / follower as leader, see if bootstrap is happening.
3. Observe behaviour of source and target tlogs across all cores in both source 
and target collections.

bq. 1. Restart source and target clusters, see if bootstrap is happening.
 
No bootstrap except the obvious, when's required. The combinations I tested:
1. CDCR stop, buffer enable, index X documents and then CDCR on, multiple 
restarts
2. CDCR stop, buffer disable, index X documents and then CDCR on, multiple 
restarts
3. CDCR stop, buffer enable,  index X documents and then CDCR on, buffer 
enable, multiple restarts
4. CDCR stop, buffer disable,  index X documents and then CDCR on, buffer 
disable, multiple restarts
5. Above 4 steps one after another on singly created source and target 
collections - clusters.

The expected behavior is observed, bootstrap when CDCR on.

bq. 2.  On 2x2 source and target collection - clusters, shut down one node / 
leader to get the other nodes / follower as leader, see if bootstrap is 
happening.

No bootstrap except the obvious, when's required. The combinations I tested:
1. CDCR stop, buffer enable, index X documents and then CDCR on, shut down the 
leader node
2. CDCR stop, buffer disable, index X documents and then CDCR on, shut down the 
leader node
3. CDCR stop, buffer enable,  index X documents and then CDCR on, buffer 
enable, shut down the leader node
4. CDCR stop, buffer disable,  index X documents and then CDCR on, buffer 
disable, shut down the leader node
5. Above 4 steps one after another on singly created source and target 
collections - clusters.

The expected behavior is observed, bootstrap when CDCR on. 
{{COLLECTIONCHECKPOINT}} and {{LASTPROCESSESVERSION}} are transferred / 
referred to corresponding new leader elected successfully. 

bq. 3. Observe behaviour of source and target tlogs across all cores in both 
source and target collections.

This was peculiar and as stated by Erick on an offline discussion, I had the 
same observations;
a) When buffer enable, all the tlogs are maintained forever on disk.
b) Once we disable, when no indexing is taking place, it remains as it is.
c) When a single document is indexed after that, the old tlogs gets purged, *it 
doesn't maintain 10 tlogs ONLY as expected*, but more which gradually decreases 
as we index along.
d) There are times only 1-2 tlogs will be present in each core of source 
collections, as observed by Erick too, when we stop indexing all together or 
index slowly. *Not sure of the reason*, didn't had a chance to look into, but I 
speculate there is no need to maintain 10 or N definite number but to keep a 
tab on the last processed tlog version, I suppose, that could be 2nd, 10th or 
30th, depends ?!



was (Author: sarkaramr...@gmail.com):
Thank you Erick for clarifying the root cause. I see LPV may very well not be 
the issue we are facing here, pardon my limited testing on this.

Three things I tested on limited schedule to see if bootstrapping is happening 
with Erick's patch on {{branch_6x}}:

1. Restart source and target clusters at different intervals, see if bootstrap 
is happening.
2. On 2x2 source and target collection - clusters, shut down one node / leader 
to get the other nodes / follower as leader, see if bootstrap is happening.
3. Observe behaviour of source and target tlogs across all cores in both source 
and target collections.

bq. 1. Restart source and target clusters, see if bootstrap is happening.
 
No bootstrap except the obvious, when's required. The combinations I tested:
1. CDCR stop, buffer enable, index X documents and then CDCR on, multiple 
restarts
2. CDCR stop, buffer disable, index X documents and then CDCR on, multiple 
restarts
3. CDCR stop, buffer enable,  index X documents and then CDCR on, buffer 
enable, multiple restarts
4. CDCR stop, buffer disable,  index X documents and then CDCR on, buffer 
disable, multiple restarts
5. Above 4 steps one after another on singly created source and target 
collections - clusters.

The expected behavior is observed, bootstrap when CDCR on.

bq. 2.  On 2x2 source and target collection - clusters, shut down one node / 
leader to get the other nodes / follower as leader, see if bootstrap is 
happening.

No bootstrap except the obvious, when's required. The combinations I tested:
1. CDCR stop, buffer enable, index X documents and then CDCR on, shut down the 
leader node
2. CDCR stop, buffer disable, index X documents and then CDCR on, shut down the 
leader node
3. CDCR stop, buffer enable,  index X documents and then CDCR on, buffer 
enable, shut down the leader node
4. CDCR stop, buffer disable,  index X documents and then CDCR on, buffer 
disable, shut down the leader node
5. Above 4 steps one after another on singly created source and target 
collections - clusters.

The expected behavior is observed, bootstrap when CDCR on. 
{{COLLECTIONCHECKPOINT}} and {{LASTPROCESSESVERSION}} are transferred / 
referred to corresponding new leader elected successfully. 

bq. 3. Observe behaviour of source and target tlogs across all cores in both 
source and target collections.

This was peculiar and as stated by Erick on an offline discussion, I had the 
same observations;
a) When buffer enable, all the tlogs are maintained forever on disk.
b) Once we disable, when no indexing is taking place, it remains as it is.
c) When a single document is indexed after that, the old tlogs gets purged, *it 
doesn't maintain 10 tlogs ONLY as expected*, but more which gradually decreases 
as we index along.
d) There are times only 1-2 tlogs will be present in each core of source 
collections, as observed by Erick too, when we stop indexing all together or 
index slowly. *Not sure of the reason*, didn't had a chance to look into, but I 
speculate there is no need to maintain 10 or N definite number but to keep a 
tab on the last processed tlog version, I suppose, that could be 2nd, 10th or 
30th, depends ?!


> LASTPROCESSEDVERSION for CDCR is flawed when buffering is enabled
> -----------------------------------------------------------------
>
>                 Key: SOLR-11069
>                 URL: https://issues.apache.org/jira/browse/SOLR-11069
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: CDCR
>    Affects Versions: 7.0
>            Reporter: Amrit Sarkar
>            Assignee: Erick Erickson
>         Attachments: SOLR-11069.patch
>
>
> {{LASTPROCESSEDVERSION}} (a.b.v. LPV) action for CDCR breaks down due to 
> poorly initialised and maintained buffer log for either source or target 
> cluster core nodes.
> If buffer is enabled for cores of either source or target cluster, it return 
> {{-1}}, *irrespective of number of entries in tlog read by the {{leader}}* 
> node of each shard of respective collection of respective cluster. Once 
> disabled, it starts telling us the correct LPV for each core.
> Due to the same flawed behavior, Update Log Synchroniser may doesn't work 
> properly as expected, i.e. provides incorrect seek to the {{non-leader}} 
> nodes to advance at. I am not sure whether this is an intended behavior for 
> sync but it surely doesn't feel right.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to