[ https://issues.apache.org/jira/browse/SOLR-11815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16310351#comment-16310351 ]
Ishan Chattopadhyaya edited comment on SOLR-11815 at 1/3/18 10:11 PM: ---------------------------------------------------------------------- I tried to reproduce the scenario by modifying TestTlogReplicas#testKillLeader(). I was unable to do so, since ChaosMonkey#stop() and ChaosMonkey#kill() both seem to kill the leader (instead of a graceful shutdown). I could reproduce it the hard way, using Docker: https://github.com/chatman/solr-grafana-docker/tree/master/tlog-restart In short, here's my test scenario (https://github.com/chatman/solr-grafana-docker/blob/master/tlog-restart/indexing.sh): # create collection with two TLOG replicas # add few docs # commit # add few docs # commit # add few docs # stop leader # add few docs (to new leader) # commit # bring back old leader # observe old leader's logs (it tries a fullCopy=true) Full copy happens due to mismatch; But, this is wasteful, since all but the recent segments are the same. was (Author: ichattopadhyaya): I tried to reproduce the scenario by modifying TestTlogReplicas#testKillLeader(). I was unable to do so, since ChaosMonkey#stop() and ChaosMonkey#kill() both seem to kill the leader (instead of a graceful shutdown). I could reproduce it the hard way, using Docker: https://github.com/chatman/solr-grafana-docker/tree/master/tlog-restart > TLOG leaders going down and rejoining as a replica do fullCopy when not needed > ------------------------------------------------------------------------------ > > Key: SOLR-11815 > URL: https://issues.apache.org/jira/browse/SOLR-11815 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: replication (java) > Affects Versions: 7.2 > Environment: Oracle JDK 1.8 > Ubuntu 16.04 > Reporter: Shaun Sabo > Assignee: Ishan Chattopadhyaya > > I am running a collection with a persistent high volume of writes. When the > leader goes down and recovers, it joins as a replica and asks the new leader > for the files to Sync. The isIndexStale check is finding that some files > differ in size and checksum which forces a fullCopy. Since our indexes are > rather large, a rolling restart is resulting in large amounts of data > transfer, and in some cases disk space contention issues. > I do not believe the fullCopy is necessary given the circumstances. > Repro Steps: > 1. collection/shard with 1 leader and 1 replica are accepting writes > - Pull interval is 30 seconds > - Hard Commit interval is 60 seconds > 2. Replica executes an index pull and completes. > 3. Leader process Hard Commits (replica index is delayed) > 4. leader process is killed (SIGTERM) > 5. Replica takes over as new leader > 6. New leader applies TLOG since last pull (cores are binary-divergent now) > 7. Former leader comes back as New Replica > 8. New replica initiates recovery > - Recovery detects that the generation and version are behind and a check > is necessary > 9. isIndexStale() detects that a segment exists on both the New Replica and > New Leader but that the size and checksum differ. > - This triggers fullCopy to be flagged on > 10. Entirety of index is pulled regardless of changes > The majority of files should not have changes, but everything gets pulled > because of the first file it finds with a mismatched checksum. > Relevant Code: > https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java#L516-L518 > https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java#L1105-L1126 -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org