[jira] [Commented] (SOLR-10751) Master/Slave IndexVersion conflict

2019-02-22 Thread Cao Manh Dat (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774998#comment-16774998
 ] 

Cao Manh Dat commented on SOLR-10751:
-

tl;dr: I'm good with going with #2.

> Master/Slave IndexVersion conflict
> --
>
> Key: SOLR-10751
> URL: https://issues.apache.org/jira/browse/SOLR-10751
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.0
>Reporter: Tomás Fernández Löbbe
>Assignee: Tomás Fernández Löbbe
>Priority: Major
> Attachments: SOLR-10751.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I’ve been looking at some failures in the replica types tests. One strange 
> failure I noticed is, master and slave share the same version, but have 
> different generation. The IndexFetcher code does more or less this:
> {code}
> masterVersion = fetchMasterVersion()
> masterGeneration = fetchMasterGeneration()
> if (masterVersion == 0 && slaveGeneration != 0 && forceReplication) {
>delete my index
>commit locally
>return
> } 
> if (masterVersion != slaveVersion) {
>   fetchIndexFromMaster(masterGeneration)
> } else {
>   //do nothing, master and slave are in sync.
> }
> {code}
> The problem I see happens with this sequence of events:
> delete index in master (not a DBQ=\*:\*, I mean a complete removal of the 
> index files and reload of the core)
> replication happens in slave (sees a version 0, deletes local index and 
> commit)
> add document in master and commit
> if the commit in master and in the slave happen at the same millisecond*, 
> they both end up with the same version, but different indices. 
> I think that in addition of checking for the same version, we should validate 
> that slave and master have the same generation and If not, consider them not 
> in sync, and proceed to the replication.
> True, this is a situation that's difficult to happen in a real prod 
> environment and it's more likely to affect tests, but I think the change 
> makes sense. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10751) Master/Slave IndexVersion conflict

2019-02-22 Thread Cao Manh Dat (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774996#comment-16774996
 ] 

Cao Manh Dat commented on SOLR-10751:
-

Hi [~tomasflobbe], here are some of my analysis at this case (when we go along 
with #2 for TLOG replica)
Case 1: The wipe out is done by a DBQ, then it will present replica's tlog, in 
any cases latter, the leader continue serving or get down, we are guaranteed 
that. The replica will have enough data to continue. I think #2 is great 
solution in this case, we can avoid of cases where both leader and replica 
index is empty, but they have different things in tlog. The only downside here 
is {{DBQ *:*}} will makes tlog replicas out-of-sync with the leader until the 
next commit happen in leader, this change in behaviour should be noted to users.

Case 2: The wipe out is done without a DBQ and leader is healthy until the next 
commit. We still fine here since commit version is generated incremental, so 
only updates after the next commit are copied over.

Case 3: The wipe out is done without a DBQ and leader get down before finish 
the next commit. The index of the shard is unpredictable now. 

> Master/Slave IndexVersion conflict
> --
>
> Key: SOLR-10751
> URL: https://issues.apache.org/jira/browse/SOLR-10751
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.0
>Reporter: Tomás Fernández Löbbe
>Assignee: Tomás Fernández Löbbe
>Priority: Major
> Attachments: SOLR-10751.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I’ve been looking at some failures in the replica types tests. One strange 
> failure I noticed is, master and slave share the same version, but have 
> different generation. The IndexFetcher code does more or less this:
> {code}
> masterVersion = fetchMasterVersion()
> masterGeneration = fetchMasterGeneration()
> if (masterVersion == 0 && slaveGeneration != 0 && forceReplication) {
>delete my index
>commit locally
>return
> } 
> if (masterVersion != slaveVersion) {
>   fetchIndexFromMaster(masterGeneration)
> } else {
>   //do nothing, master and slave are in sync.
> }
> {code}
> The problem I see happens with this sequence of events:
> delete index in master (not a DBQ=\*:\*, I mean a complete removal of the 
> index files and reload of the core)
> replication happens in slave (sees a version 0, deletes local index and 
> commit)
> add document in master and commit
> if the commit in master and in the slave happen at the same millisecond*, 
> they both end up with the same version, but different indices. 
> I think that in addition of checking for the same version, we should validate 
> that slave and master have the same generation and If not, consider them not 
> in sync, and proceed to the replication.
> True, this is a situation that's difficult to happen in a real prod 
> environment and it's more likely to affect tests, but I think the change 
> makes sense. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10751) Master/Slave IndexVersion conflict

2019-02-21 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SOLR-10751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774781#comment-16774781
 ] 

Tomás Fernández Löbbe commented on SOLR-10751:
--

I created a PR with #2, still WIP. In the PR, I only handle the version 0 case 
differently for PULL replicas, however, [~caomanhdat] did something related for 
TLOG replicas. For the TLOG, there is no commit, however, the replica opens a 
new searcher and updates the commit point in the {{IndexFetcher}}. I'm guessing 
this is so that the TLOG replicas show 0 results for the search, and also if it 
becomes the leader, the followers will replicate the empty index from the 
leader. I'm wondering if for TLOG replicas we would want the same behavior than 
PULLs actually, and no replication happening in the case of the version 0?
 [~caomanhdat], [~shalinmangar], your input would be great.
 As for testing, both {{TestPullReplica}} and {{TestTlogReplica}} are disabled 
with {{@AwaitsFix}} at this point. I enabled {{TestPullReplica}} and It's in 
good shape. {{TestTlogReplica}} did have many failures, I'm going to take a 
look at. {{ChaosMonkeyNothingIsSafeWithPullReplicasTest}} is also looking 
better (1 failure after 1k runs, and it's an object leak that seems related to 
this {{openNewSearcherAndUpdateCommitPoint}} code actually)

> Master/Slave IndexVersion conflict
> --
>
> Key: SOLR-10751
> URL: https://issues.apache.org/jira/browse/SOLR-10751
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.0
>Reporter: Tomás Fernández Löbbe
>Assignee: Tomás Fernández Löbbe
>Priority: Major
> Attachments: SOLR-10751.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I’ve been looking at some failures in the replica types tests. One strange 
> failure I noticed is, master and slave share the same version, but have 
> different generation. The IndexFetcher code does more or less this:
> {code}
> masterVersion = fetchMasterVersion()
> masterGeneration = fetchMasterGeneration()
> if (masterVersion == 0 && slaveGeneration != 0 && forceReplication) {
>delete my index
>commit locally
>return
> } 
> if (masterVersion != slaveVersion) {
>   fetchIndexFromMaster(masterGeneration)
> } else {
>   //do nothing, master and slave are in sync.
> }
> {code}
> The problem I see happens with this sequence of events:
> delete index in master (not a DBQ=\*:\*, I mean a complete removal of the 
> index files and reload of the core)
> replication happens in slave (sees a version 0, deletes local index and 
> commit)
> add document in master and commit
> if the commit in master and in the slave happen at the same millisecond*, 
> they both end up with the same version, but different indices. 
> I think that in addition of checking for the same version, we should validate 
> that slave and master have the same generation and If not, consider them not 
> in sync, and proceed to the replication.
> True, this is a situation that's difficult to happen in a real prod 
> environment and it's more likely to affect tests, but I think the change 
> makes sense. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10751) Master/Slave IndexVersion conflict

2017-05-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-10751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026695#comment-16026695
 ] 

Tomás Fernández Löbbe commented on SOLR-10751:
--

OK, I  see now why this hasn't been a problem so far. Note that the "delete my 
index" only happens in case of a "forced replication". Forced replications in 
Master/Slave can only happen in a retry, which should not happen if the master 
is returning version 0 (unless I'm misunderstanding something here, this code 
should never be executed if you are running Master/Slave). In SolrCloud mode, a 
forced replication can happen if the last attempt to replicate was 
unsuccessful. Until now the replication in SolrCloud was only for recovery, and 
Cloud mode it's "OK" to have different versions of the index, plus, in the 
particular test example I described in the issue, the replication would have 
been followed by the application of the buffered updates, so indices would be 
soon in sync. This becomes an issue only now that we have TLOG and PULL 
replicas.

In any case, we need to fix it now for the new scenario. I also like your #2 
option (#1 sounds like too big of a change), and It should be easy to 
implement, although NRT replicas still need this logic I believe. 

> Master/Slave IndexVersion conflict
> --
>
> Key: SOLR-10751
> URL: https://issues.apache.org/jira/browse/SOLR-10751
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (7.0)
>Reporter: Tomás Fernández Löbbe
>Assignee: Tomás Fernández Löbbe
> Attachments: SOLR-10751.patch
>
>
> I’ve been looking at some failures in the replica types tests. One strange 
> failure I noticed is, master and slave share the same version, but have 
> different generation. The IndexFetcher code does more or less this:
> {code}
> masterVersion = fetchMasterVersion()
> masterGeneration = fetchMasterGeneration()
> if (masterVersion == 0 && slaveGeneration != 0 && forceReplication) {
>delete my index
>commit locally
>return
> } 
> if (masterVersion != slaveVersion) {
>   fetchIndexFromMaster(masterGeneration)
> } else {
>   //do nothing, master and slave are in sync.
> }
> {code}
> The problem I see happens with this sequence of events:
> delete index in master (not a DBQ=*:*, I mean a complete removal of the index 
> files and reload of the core)
> replication happens in slave (sees a version 0, deletes local index and 
> commit)
> add document in master and commit
> if the commit in master and in the slave happen at the same millisecond*, 
> they both end up with the same version, but different indices. 
> I think that in addition of checking for the same version, we should validate 
> that slave and master have the same generation and If not, consider them not 
> in sync, and proceed to the replication.
> True, this is a situation that's difficult to happen in a real prod 
> environment and it's more likely to affect tests, but I think the change 
> makes sense. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10751) Master/Slave IndexVersion conflict

2017-05-26 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026620#comment-16026620
 ] 

Hoss Man commented on SOLR-10751:
-

bq. so, "0" is not really the version of the index, but it's that the master 
responds to the slaves when there is no replicable index. 

And to elaborate on our IRC conversation, at the point where we were theorizing 
why the master might return "0" (before tomas had found this particular bit of 
code and verified it matched our theory) -- i posed the following straw man 
suggestion(s) for dealing with this special "sentinal" value of "i have no 
index"...

# We could change solr core/updateHanlder initialization so there is _never_ a 
situation where a solr core is responding to requests, but has no index / 
commitPoint -- thus completely eliminating the need for the sentinal value & 
special case logic on slaves, because they will always have _something_ they 
can fetch
#* ie: on startup, if no index, create & commit immediately
# we could "fix" the semantics of replication on the slave side...
#* if the master returns indexVersion==0, the slave treats that as a "master 
has nothing to replicate, i should do nothing" (and possibly 'fail' if the 
replication was explicitly requested vs/timmer based)
#* as opposed to current logic which is "master has nothing to replicate, i 
will blindly create my own arbitrary index indepdent of master (via deleteAll)

I still think either one of these options would be a good idea -- depending on 
what we want the semantics to be:  

# Should a situation where an external force blows away the master index (or 
someone forces a node w/o an index to be a leader) cause slaves/replicas to 
*immediately* purge all data?
# Or should slaves/replicas keep what they've got until the master/leader 
actually has something for them to replicate?  

Personally i think #2 makes more sense.

As a practical example: Assume someone is doing classic master/slave 
replication and their master has a hardware failure.  the slaves are still 
serving queries just fine.  rather then swap out an existing slave to be the 
new master the admin creates an entirely new serve to be the master and plans 
on rebuilding the index -- but by reusing the master.company.com hostname, the 
new node starts recieving /replication requests immediately from the existing 
slaves.  Should those slaves really be immediately deleting all docs from their 
local indexes even though the master is explicitly telling them "i have nothing 
for you to replicate" ? ... that sounds like a bug to me.

On the flip side: if chaos has rained down on a SolrCloud cluster, and a new 
leader w/o any index at all has popped up -- i think it's "ok" for replicas to 
serve stale data until the leader has new data for them ... but if think that 
in the cloud case it's important that all replicas should _immediately_ 
"recover" the "theoretically empty if it did exist" version of the index from 
their leader, then perhaps the leader election code should involve a special 
case to force a commit on the leader if it has no existing commit points? 



Either way, i *ALSO* have the vague impression that tomas's primary suggestion 
of always checking generation is correct as well ... but it seems so obvious 
i'm not sure if there is some good why the code doesn't already do that that 
i'm oblivious too?


> Master/Slave IndexVersion conflict
> --
>
> Key: SOLR-10751
> URL: https://issues.apache.org/jira/browse/SOLR-10751
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (7.0)
>Reporter: Tomás Fernández Löbbe
>Assignee: Tomás Fernández Löbbe
> Attachments: SOLR-10751.patch
>
>
> I’ve been looking at some failures in the replica types tests. One strange 
> failure I noticed is, master and slave share the same version, but have 
> different generation. The IndexFetcher code does more or less this:
> {code}
> masterVersion = fetchMasterVersion()
> masterGeneration = fetchMasterGeneration()
> if (masterVersion == 0 && slaveGeneration != 0 && forceReplication) {
>delete my index
>commit locally
>return
> } 
> if (masterVersion != slaveVersion) {
>   fetchIndexFromMaster(masterGeneration)
> } else {
>   //do nothing, master and slave are in sync.
> }
> {code}
> The problem I see happens with this sequence of events:
> delete index in master (not a DBQ=*:*, I mean a complete removal of the index 
> files and reload of the core)
> replication happens in slave (sees a version 0, deletes local index and 
> commit)
> add document in master and commit
> if the commit in master and in the slave happen at the same millisecond*, 
> they both end up with the same version, but different indices. 
> I thi

[jira] [Commented] (SOLR-10751) Master/Slave IndexVersion conflict

2017-05-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-10751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025410#comment-16025410
 ] 

Tomás Fernández Löbbe commented on SOLR-10751:
--

[~hossman] and I had a conversation about this on IRC yesterday, and his 
concern was "Why is master creating an index with version 0 and the slave is 
not". After investigating some more, I noticed this code in the 
{{ReplicationHandler}}
{code:java}
if (commitPoint != null && replicationEnabled.get()) {
//
// There is a race condition here.  The commit point may be changed / 
deleted by the time
// we get around to reserving it.  This is a very small window though, 
and should not result
// in a catastrophic failure, but will result in the client getting an 
empty file list for
// the CMD_GET_FILE_LIST command.
//

core.getDeletionPolicy().setReserveDuration(commitPoint.getGeneration(), 
reserveCommitDuration);
rsp.add(CMD_INDEX_VERSION, 
IndexDeletionPolicyWrapper.getCommitTimestamp(commitPoint));
rsp.add(GENERATION, commitPoint.getGeneration());
  } else {
// This happens when replication is not configured to happen after 
startup and no commit/optimize
// has happened yet.
rsp.add(CMD_INDEX_VERSION, 0L);
rsp.add(GENERATION, 0L);
  }
{code}
so, "0" is not really the version of the index, but it's that the master 
responds to the slaves when there is no replicable index. 

> Master/Slave IndexVersion conflict
> --
>
> Key: SOLR-10751
> URL: https://issues.apache.org/jira/browse/SOLR-10751
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (7.0)
>Reporter: Tomás Fernández Löbbe
>Assignee: Tomás Fernández Löbbe
>
> I’ve been looking at some failures in the replica types tests. One strange 
> failure I noticed is, master and slave share the same version, but have 
> different generation. The IndexFetcher code does more or less this:
> {code}
> masterVersion = fetchMasterVersion()
> masterGeneration = fetchMasterGeneration()
> if (masterVersion == 0 && slaveGeneration != 0 && forceReplication) {
>delete my index
>commit locally
>return
> } 
> if (masterVersion != slaveVersion) {
>   fetchIndexFromMaster(masterGeneration)
> } else {
>   //do nothing, master and slave are in sync.
> }
> {code}
> The problem I see happens with this sequence of events:
> delete index in master (not a DBQ=*:*, I mean a complete removal of the index 
> files and reload of the core)
> replication happens in slave (sees a version 0, deletes local index and 
> commit)
> add document in master and commit
> if the commit in master and in the slave happen at the same millisecond*, 
> they both end up with the same version, but different indices. 
> I think that in addition of checking for the same version, we should validate 
> that slave and master have the same generation and If not, consider them not 
> in sync, and proceed to the replication.
> True, this is a situation that's difficult to happen in a real prod 
> environment and it's more likely to affect tests, but I think the change 
> makes sense. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org