[jira] [Comment Edited] (SOLR-8372) Canceled recovery can lead to data loss

Yonik Seeley (JIRA) Wed, 09 Dec 2015 19:23:13 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049944#comment-15049944
 ]


Yonik Seeley edited comment on SOLR-8372 at 12/10/15 3:21 AM:
--------------------------------------------------------------

I've been thinking about possible ways to deal with this:
- stop updates at a higher level... if the distributed update processor knows 
it's not in the right state to accept updates, then reject them
  -- this has problems with race conditions unless the check/reject is with the 
bucket lock held
- keep buffering updates when recovery is canceled.  When another call to 
bufferUpdates() is made, reset the starting position so we know where replay 
needs to start from.
- introduce a new state into UpdateLog (the current states are REPLAYING, 
BUFFERING, APPLYING_BUFFERED, ACTIVE)
  -- this new state would do what?  Silently drop updates it receives?  Throw 
an exception?  The latter would seem to complicate things further if it could 
possibly cause another node to put us into LIR again.

In anticipation of really hairy scenarios, keeping updates might be useful 
rather than dropping them.  So perhaps the "keep buffering" option may be 
simplest as it also avoids introducing another state?  We should normally 
receive only a a few more updates that were in the pipeline when something 
happened to our recovery attempt anyway (like the leader dying)? 



was (Author: [email protected]):
I've been thinking about possible ways to deal with this:
- stop updates at a higher level... if the distributed update processor knows 
it's not in the right state to accept updates, then reject them
  -- this has problems with race conditions unless the check/reject is with the 
bucket lock held
- keep buffering updates when recovery is canceled.  When another call to 
bufferUpdates() is made, reset the starting position so we know where replay 
needs to start from.
- introduce a new state into UpdateLog (the current states are REPLAYING, 
BUFFERING, APPLYING_BUFFERED, ACTIVE)
  -- this new state would do what?  Silently drop updates it receives?  Throw 
an exception?  The latter would seem to complicate things further if it could 
possibly cause another node to put us into LIR again.

In really hairy scenarios, one might think that keeping updates might be useful 
rather than dropping them.  So perhaps the "keep buffering" option may be 
simplest as it also avoids introducing another state?  It should normally only 
a a few more updates coming in that were in the pipeline when something 
happened to our recovery attempt anyway (like the leader dying)? 


> Canceled recovery can lead to data loss
> ---------------------------------------
>
>                 Key: SOLR-8372
>                 URL: https://issues.apache.org/jira/browse/SOLR-8372
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Yonik Seeley
>
> A recovery via index replication tells the update log to start buffering 
> updates.  If that recovery is canceled for whatever reason by the replica, 
> the RecoveryStrategy calls ulog.dropBufferedUpdates() which stops buffering 
> and places the UpdateLog back in active mode.  If updates come from the 
> leader after this point (and before ReplicationStrategy retries recovery), 
> the update will be processed as normal and added to the transaction log. If 
> the server is bounced, those last updates to the transaction log look normal 
> (no FLAG_GAP) and can be used to determine who is more up to date. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-8372) Canceled recovery can lead to data loss

Reply via email to