[ https://issues.apache.org/jira/browse/SOLR-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150469#comment-16150469 ]
Amrit Sarkar commented on SOLR-11278: ------------------------------------- Alright! There are two issues in the troubled CDCR test : 1. Why bootstrap is called two seperate times in single life cycle -> first time bootstrap is called when we enable CDCR, and 1000 documents are boostrapped to target -> the second one is when bootstrapping is happening in background and new updates are getting shoved down to source. Once the first boostrap is successfully completed, target sees (I speculate) it is much behind and issues bootstrap again, which I don't really understand and putting extra logging doesn't really helps too. I am seeing more. OR all this is wrong, and boostrap is happening just one time in a weird manner. -> as above listed, the first bootstrap and indexing into source cluster happens simulteneouly at one point. How it is suppose to behave that time. according to code: CdcrRequestHandler:BootstrapCallable:call():: {code} ........ @Override public Boolean call() throws Exception { boolean success = false; Throwable exception = null; UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); // we start buffering updates as a safeguard however we do not expect // to receive any updates from the source during bootstrap ulog.bufferUpdates(); try { commitOnLeader(masterUrl); ........ {code} 2. Why bootstrap issues at the second time (if it is) fails miserably -> when the bootstrap is issued, it calls boostrapCallable to apply buffer updates and if "fetchIndex is not successful OR boostrapCallable is closed", drop those updates from UpdateLog in {{finally}} block -> in this case, when bootstrap gets issued :: it checks whether it can get the LOCK :: {code} private void handleBootstrapAction(SolrQueryRequest req, SolrQueryResponse rsp) throws IOException, SolrServerException { ....................... Runnable runnable = () -> { Lock recoveryLock = req.getCore().getSolrCoreState().getRecoveryLock(); boolean locked = recoveryLock.tryLock(); SolrCoreState coreState = core.getSolrCoreState(); try { if (!locked) { log.info("we reached this point :: CANCEL BOOTSTRAP, locked :: " + locked); handleCancelBootstrap(req, rsp); } {code} and if not, it issues CANCEL BOOTSTRAP, which happend exactly in this case. CANCEL_BOOSTRAP is issued and applied QUIETLY. -> in the background within the bootstrap, applyBufferUpdates, in ulog, {{tlog}} is mysteriously NULL, suggesting no updates were recieved, umm why? and return bootstrapFuture as null and set the STATE of replication as ACTIVE. according to code :: UpdatesLog.java :: {code} /** Returns the Future to wait on, or null if no replay was needed */ public Future<RecoveryInfo> applyBufferedUpdates() { // recovery trips this assert under some race - even when // it checks the state first // assert state == State.BUFFERING; // block all updates to eliminate race conditions // reading state and acting on it in the update processor versionInfo.blockUpdates(); try { cancelApplyBufferUpdate = false; log.info("applyBufferedUpdates :: state :: start :: " + state); if (state != State.BUFFERING){ return null; } operationFlags &= ~FLAG_GAP; // handle case when no log was even created because no updates // were received. if (tlog == null) { log.info("applyBufferedUpdates :: state :: middle 1 :: " + state); state = State.ACTIVE; return null; } log.info("applyBufferedUpdates :: state :: middle 2 :: " + state); {code} -> now in finally block, the fetchIndex is successfull but "closed" is true (DUE TO CANCEL BOOTSTRAP apprently) which is initialised to false:: it tries to drop the buffer updates from ulog where :: {code} /** Returns true if we were able to drop buffered updates and return to the ACTIVE state */ public boolean dropBufferedUpdates() { versionInfo.blockUpdates(); try { log.info("dropBufferedUpdates :: state :: start " + state); if (state != State.BUFFERING){ return false; } ....... {code} STATE is ACTIVE and returns {{false}}. faling the "assert dropped" in BootstrapCallable and eventually emitting: "bootstrap operation failed". Once bootstrap fails, in CdcrBootsrapTest::waitTargetToSync :: it waits for 120 seconds, and recieves constant numFound from target, == 1000 / 1001 / 1100, any number basically before boostrap failed and eventually Assertion at CdcrBootsrapTest fails. There are still some unanswered questions like: 1. Why in the first place two boostrap command are issued OR we think we two are issued and it one wrapped up in an awkward life time. Why the conventional forwarding not happening? 2. When it tries to drop the buffer updates, why the tlog is null? It has recieved updates from source, but why it is NULL in target cluster. Is there a synchrnization issue? so the matter of the question is this lock :: {code} Lock recoveryLock = req.getCore().getSolrCoreState().getRecoveryLock(); boolean locked = recoveryLock.tryLock(); {code} More insights will be deeply appreciated. I believe we are very close. > CdcrBootstrapTest failing in branch_6_6 > --------------------------------------- > > Key: SOLR-11278 > URL: https://issues.apache.org/jira/browse/SOLR-11278 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: CDCR > Reporter: Amrit Sarkar > Assignee: Varun Thacker > Attachments: SOLR-11278-cancel-bootstrap-on-stop.patch, > SOLR-11278.patch, test_results > > > I ran beast for 10 rounds: > ant beast -Dtestcase=CdcrBootstrapTest -Dtests.multiplier=2 -Dtests.slow=true > -Dtests.locale=vi -Dtests.timezone=Asia/Yekaterinburg -Dtests.asserts=true > -Dtests.file.encoding=US-ASCII -Dbeast.iters=10 > and seeing following failure: > {code} > [beaster] [01:37:16.282] FAILURE 153s | > CdcrBootstrapTest.testBootstrapWithSourceCluster <<< > [beaster] > Throwable #1: java.lang.AssertionError: Document mismatch on > target after sync expected:<2000> but was:<1000> > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org