[ 
https://issues.apache.org/jira/browse/SOLR-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-9941:
---------------------------
    Attachment: SOLR-9941.patch

bq. In TestRecovery, we kinda mixed up single instance code path along with 
SolrCloud mode code path, so we end up with the fail.

yeah, that was the piece i wasn't clear about -- makes sense.

based on Dat's comments, I re-did my previous idea as a new independent 
{{testNewDBQAndDocMatchingOldDBQDuringLogReplay}} method:

* the "client" only sends regular updates (w/o fake versions) as it would to a 
leader/single-node instance
* confirms that a DBQ which arrives during recovery is processed correctly, 
even if the docs being deleted haven't been added yet as part of recovery
* Also includes a DBQ in the tlog which may (randomly) be applied before/after 
the new DBQ arrives
** just to sanity check that the new delete doesn't muck with the replay of old 
deletes
* the DBQ from the tlog also matches a doc which isn't added until recovery is 
in process (randomly before/after the DBQ from the tlog is applied) to verify 
that doc isn't deleted by mistake

...i've added this new test method to the patch to help with the coverage of 
the affected code paths.

bq. Unless someone has any objections or suggests some modifications, I'd like 
to commit this after the 6.4 branch is cut (or alternatively, commit this to 
master and wait for backporting after the 6.4 branch is cut).

The sooner it gets committed to master, the sooner jenkins starts helping us do 
randomized testing -- since Dat thinks the change makes sense, and there are no 
objections I would encourage committing to master early and waiting to backport.



> log replay redundently (pre-)applies DBQs as if they were out of order
> ----------------------------------------------------------------------
>
>                 Key: SOLR-9941
>                 URL: https://issues.apache.org/jira/browse/SOLR-9941
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>         Attachments: SOLR-9941.hoss-test-experiment.patch, SOLR-9941.patch, 
> SOLR-9941.patch, SOLR-9941.patch, SOLR-9941.patch, SOLR-9941.patch
>
>
> There's kind of an odd situation that arises when a Solr node starts up 
> (after a crash) and tries to recover from it's tlog that causes deletes to be 
> redundantly & excessively applied -- at a minimum it causes confusing really 
> log messages....
> * {{UpdateLog.init(...)}} creates {{TransactionLog}} instances for the most 
> recent log files found (based on numRecordsToKeep) and then builds a 
> {{RecentUpdates}} instance from them
> * Delete entries from the {{RecentUpdates}} are used to populate 2 lists:
> ** {{deleteByQueries}}
> ** {{oldDeletes}} (for deleteById).
> * Then when {{UpdateLog.recoverFromLog}} is called a {{LogReplayer}} is used 
> to replay any (uncommited) {{TransactionLog}} enteries
> ** during replay {{UpdateLog}} delegates to the UpdateRequestProcessorChain 
> to for the various adds/deletes, etc...
> ** when an add makes it to {{RunUpdateProcessor}} it delegates to 
> {{DirectUpdateHandler2}}, which (independent of the fact that we're in log 
> replay) calls {{UpdateLog.getDBQNewer}} for every add, looking for any 
> "Reordered" deletes that have a version greater then the add
> *** if it finds _any_ DBQs "newer" then the document being added, it does a 
> low level {{IndexWriter.updateDocument}} and then immediately executes _all_ 
> the newer DBQs ... _once per add_
> ** these deletes are *also* still executed as part of the normal tlog replay, 
> because they are in the tlog.
> Which means if you are recovering from a tlog with 90 addDocs, followed by 5 
> DBQs, then *each* of those 5 DBQs will each be executed 91 times -- and for 
> 90 of those executions, a DUH2 INFO log messages will say {{"Reordered DBQs 
> detected. ..."}} even tough the only reason they are out of order is because 
> Solr is deliberately applying them out of order.
> * At a minimum we should improve the log messages
> * Ideally we should stop (pre-emptively) applying these deletes during tlog 
> replay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to