[ 
https://issues.apache.org/jira/browse/SOLR-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16550682#comment-16550682
 ] 

Shalin Shekhar Mangar commented on SOLR-12305:
----------------------------------------------

+1 LGTM

> Making buffering tlog bounded for faster recovery
> -------------------------------------------------
>
>                 Key: SOLR-12305
>                 URL: https://issues.apache.org/jira/browse/SOLR-12305
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Cao Manh Dat
>            Assignee: Cao Manh Dat
>            Priority: Major
>         Attachments: SOLR-12305.patch
>
>
> The current recovery process has 2 main problems (pointed out by 
> [~shalinmangar] ) which make it may never finish.
>  # The replay updates process is too slow, we do it in a single-thread 
> fashion. Therefore if the more updates get appended at a faster rate, the 
> replay process will be never finished
>  # The buffering tlog is unbounded, we keep adding more entries to buffering 
> tlog and waiting for them to get replayed. If we have a way to reduce the 
> number of updates in buffering tlog, even when replay process is slow it will 
> eventually finish.
> I come up with a solution for the second problem which is described on this 
> link:
> [https://docs.google.com/document/d/14DCkYRvYnQmactyWek3nYtUVdpu_CVIA4ZBTfQigjlU/edit?usp=sharing]
> In short, the document presents a modification for current recovery process 
> (section 3: algorithm) and also proof the correctness of the modification 
> (section 1 and 2). There are some pros in this approach
>  * Making buffering tlog bounded.
>  * It will automatically throttle updates from the leader, imagine this case
>  ** We have a shard with a leader and a replica. When leader sends replica an 
> update.
>  ** If the replica is healthy, the leader will have to wait for the replica 
> to finish process that updates before return to users. Let's call the total 
> time for an update is T0
>  ** If the replica is recovering, in the current code, the replica will only 
> append that update to its buffering tlog (which is much faster than 
> indexing), so the total time for an update is T1 < T0. Therefore the rate of 
> incoming updates will be higher in this case.
>  ** In above design, T1 will be subequal to T0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to