[jira] [Commented] (SOLR-11815) TLOG leaders going down and rejoining as a replica do fullCopy when not needed

Shaun Sabo (JIRA) Wed, 03 Jan 2018 14:50:17 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-11815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16310421#comment-16310421
 ]


Shaun Sabo commented on SOLR-11815:
-----------------------------------

Yup, that log line was what sent us looking at this path. 

If isIndexStale stops reporting checksum differences or file size mismatches, 
then it doesn't really do anything anymore and appears to be entirely 
removable. 

> TLOG leaders going down and rejoining as a replica do fullCopy when not needed
> ------------------------------------------------------------------------------
>
>                 Key: SOLR-11815
>                 URL: https://issues.apache.org/jira/browse/SOLR-11815
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: replication (java)
>    Affects Versions: 7.2
>         Environment: Oracle JDK 1.8
> Ubuntu 16.04
>            Reporter: Shaun Sabo
>            Assignee: Ishan Chattopadhyaya
>
> I am running a collection with a persistent high volume of writes. When the 
> leader goes down and recovers, it joins as a replica and asks the new leader 
> for the files to Sync. The isIndexStale check is finding that some files 
> differ in size and checksum which forces a fullCopy. Since our indexes are 
> rather large, a rolling restart is resulting in large amounts of data 
> transfer, and in some cases disk space contention issues.
> I do not believe the fullCopy is necessary given the circumstances. 
> Repro Steps:
> 1. collection/shard with 1 leader and 1 replica are accepting writes
>     - Pull interval is 30 seconds
>     - Hard Commit interval is 60 seconds
> 2. Replica executes an index pull and completes. 
> 3. Leader process Hard Commits (replica index is delayed)
> 4. leader process is killed (SIGTERM)
> 5. Replica takes over as new leader
> 6. New leader applies TLOG since last pull (cores are binary-divergent now)
> 7. Former leader comes back as New Replica
> 8. New replica initiates recovery
>     - Recovery detects that the generation and version are behind and a check 
> is necessary
> 9. isIndexStale() detects that a segment exists on both the New Replica and 
> New Leader but that the size and checksum differ. 
>     - This triggers fullCopy to be flagged on
> 10. Entirety of index is pulled regardless of changes
> The majority of files should not have changes, but everything gets pulled 
> because of the first file it finds with a mismatched checksum. 
> Relevant Code:
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java#L516-L518
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java#L1105-L1126



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-11815) TLOG leaders going down and rejoining as a replica do fullCopy when not needed

Reply via email to