[ 
https://issues.apache.org/jira/browse/SOLR-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150469#comment-16150469
 ] 

Amrit Sarkar commented on SOLR-11278:
-------------------------------------

Alright!

There are two issues in the troubled CDCR test :

1. Why bootstrap is called two seperate times in single life cycle

-> first time bootstrap is called when we enable CDCR, and 1000 documents are 
boostrapped to target

-> the second one is when bootstrapping is happening in background and new 
updates are getting shoved down to source. Once the first boostrap is 
successfully completed, target sees (I speculate) it is much behind and issues 
bootstrap again, which I don't really understand and putting extra logging 
doesn't really helps too. I am seeing more. OR all this is wrong, and boostrap 
is happening just one time in a weird manner.

-> as above listed, the first bootstrap and indexing into source cluster 
happens simulteneouly at one point. How it is suppose to behave that time.

according to code: CdcrRequestHandler:BootstrapCallable:call()::

{code}
........
 @Override
    public Boolean call() throws Exception {
      boolean success = false;
      Throwable exception = null;
      UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
      // we start buffering updates as a safeguard however we do not expect
      // to receive any updates from the source during bootstrap
      ulog.bufferUpdates();
      try {
        commitOnLeader(masterUrl);
........
{code}

2. Why bootstrap issues at the second time (if it is) fails miserably

-> when the bootstrap is issued, it calls boostrapCallable to apply buffer 
updates and if "fetchIndex is not successful OR boostrapCallable is closed", 
drop those updates from UpdateLog in {{finally}} block

-> in this case, when bootstrap gets issued :: it checks whether it can get the 
LOCK ::

{code}
private void handleBootstrapAction(SolrQueryRequest req, SolrQueryResponse rsp) 
throws IOException, SolrServerException {
    .......................

    Runnable runnable = () -> {
      Lock recoveryLock = req.getCore().getSolrCoreState().getRecoveryLock();
      boolean locked = recoveryLock.tryLock();
      SolrCoreState coreState = core.getSolrCoreState();
      try {
        if (!locked)  {
          log.info("we reached this point :: CANCEL BOOTSTRAP, locked :: " + 
locked);
          handleCancelBootstrap(req, rsp);
        } 
{code}

and if not, it issues CANCEL BOOTSTRAP, which happend exactly in this case. 
CANCEL_BOOSTRAP is issued and applied QUIETLY.

->  in the background within the bootstrap,  applyBufferUpdates, in ulog, 
{{tlog}} is mysteriously NULL, suggesting no updates were recieved, umm why? 
and return bootstrapFuture as null and set the STATE of replication as ACTIVE.

according to code :: UpdatesLog.java ::

{code}
/** Returns the Future to wait on, or null if no replay was needed */
  public Future<RecoveryInfo> applyBufferedUpdates() {
    // recovery trips this assert under some race - even when
    // it checks the state first
    // assert state == State.BUFFERING;

    // block all updates to eliminate race conditions
    // reading state and acting on it in the update processor
    versionInfo.blockUpdates();
    try {
      cancelApplyBufferUpdate = false;
      log.info("applyBufferedUpdates :: state :: start :: " + state);
      if (state != State.BUFFERING){
        return null;
      }
      operationFlags &= ~FLAG_GAP;

      // handle case when no log was even created because no updates
      // were received.
      if (tlog == null) {
        log.info("applyBufferedUpdates :: state :: middle 1 :: " + state);
        state = State.ACTIVE;
        return null;
      }
      log.info("applyBufferedUpdates :: state :: middle 2 :: " + state);
{code}

-> now in finally block, the fetchIndex is successfull but "closed" is true 
(DUE TO CANCEL BOOTSTRAP apprently) which is initialised to false:: it tries to 
drop the buffer updates from ulog where ::

{code}
 /** Returns true if we were able to drop buffered updates and return to the 
ACTIVE state */
  public boolean dropBufferedUpdates() {
    versionInfo.blockUpdates();
    try {
      log.info("dropBufferedUpdates :: state :: start " + state);
      if (state != State.BUFFERING){
        return false;
      }
.......
{code}

STATE is ACTIVE and returns {{false}}.

faling the "assert dropped" in BootstrapCallable and eventually emitting: 
"bootstrap operation failed".

Once bootstrap fails, in CdcrBootsrapTest::waitTargetToSync :: it waits for 120 
seconds, and recieves constant numFound from target, == 1000 / 1001 / 1100, any 
number basically before boostrap failed and eventually Assertion at 
CdcrBootsrapTest fails.

There are still some unanswered questions like:

1. Why in the first place two boostrap command are issued OR we think we two 
are issued and it one wrapped up in an awkward life time. Why the conventional 
forwarding not happening?
2. When it tries to drop the buffer updates, why the tlog is null? It has 
recieved updates from source, but why it is NULL in target cluster. Is there a 
synchrnization issue?

so the matter of the question is this lock ::

{code}
Lock recoveryLock = req.getCore().getSolrCoreState().getRecoveryLock();
      boolean locked = recoveryLock.tryLock();
{code}

More insights will be deeply appreciated. I believe we are very close.

> CdcrBootstrapTest failing in branch_6_6
> ---------------------------------------
>
>                 Key: SOLR-11278
>                 URL: https://issues.apache.org/jira/browse/SOLR-11278
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: CDCR
>            Reporter: Amrit Sarkar
>            Assignee: Varun Thacker
>         Attachments: SOLR-11278-cancel-bootstrap-on-stop.patch, 
> SOLR-11278.patch, test_results
>
>
> I ran beast for 10 rounds:
> ant beast -Dtestcase=CdcrBootstrapTest -Dtests.multiplier=2 -Dtests.slow=true 
> -Dtests.locale=vi -Dtests.timezone=Asia/Yekaterinburg -Dtests.asserts=true 
> -Dtests.file.encoding=US-ASCII -Dbeast.iters=10
> and seeing following failure:
> {code}
>   [beaster] [01:37:16.282] FAILURE  153s | 
> CdcrBootstrapTest.testBootstrapWithSourceCluster <<<
>   [beaster]    > Throwable #1: java.lang.AssertionError: Document mismatch on 
> target after sync expected:<2000> but was:<1000>
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to