Hoss Man updated SOLR-5944:
    Attachment: SOLR-5944.patch

Ok -- i've been reviewing Ishan's latest patch.

In a conversation I had with him offline, he mentioned that he hadn't been 
doing any further work on this locally since his last update, so w/o any risk 
of collision I went ahead and made a bunch of small updates directly...

{panel:title=Small changes I made directly in latest patch}
* misc javadoc tweaks & improvements

* some logging cleanup & template improvements
** Anytime we log anything that includes a complex object (that we expect to be 
toString()-ified) is a situation where we should _definitely_ be using 
templated log messages to avoid potentially expensive calls to toString() in 
the case that the log level results in the method being a No-Op.
** I probably missed some, but i tried to catch as many as i could

* some simple method renames:
** UpdateLog.addUpdateCommandFromTlog -> convertTlogEntryToAddUpdateCommand
*** older name was ambigiuous because of dual meaning of "add", now verb is 
** AtomicUpdateDocumentMerger.getSearcherNonStoredDVs -> 

* refactored some duplicate code into new helper methods:
** DirectUpdateHandler2.updateDocOrDocValues

* refactor some loops where methods were being called unneccessarily in every 
** DocumentBuilder.toDocument
** AtomicUpdateDocumentMerger.isInPlaceUpdate (IndexSchema.getUniqueKeyField 
and IndexWriter.getFieldNames())

* added some comments containing important details/reasoning that had only been 
captured in jira comments so far
** AtomicUpdateDocumentMerger.doInPlaceUpdateMerge

* increased usage of expectThrows, beefed up asserts related to expected 

* added some nocommit comments regarding some questionable lines of code 
(either new questions, or things I've provided feedback on before that keep 
slipping through the cracks)

* updated to master/98f75723f3bc6a718f1a7b47a50b820c4fb408f6


More in-depth/concrete responses/comments to the previous patch are listed 
below -- anything mentioned below that still seems problematic to me should 
also have nocommit comments in the code associated with them, but i wanted to 
elaborate a bit here in the jira comments.

(The reverse is not neccessarily true however: I added nocommits for some small 
things that I didn't bother to add jira comments aboute)

FWIW: I was initially concerned that change all these _existing_ dynamicFields 
from indexed=true to indexed=false might be weakening other tests, but I did a 
quick grep for their usage in other test methods (prior to this jira) and the 
only impacts i found were in AtomicUpdatesTest, where the only concern was 
docValues=true and stored=false.  So changing these to indexed=false doesn't 
impact that test at all (they are never queried against)

Based on the "dvo" naming, (which i think a reasonable person would read as 
"doc values only") making these fields indexed=false seems like a good idea



* In general, i think there's a gap in the DUP logic related to how the results 
of fetchFullUpdateFromLeader are used when replicas never get dependent updates
** see detailed nocommit comments in the body of fetchFullUpdateFromLeader for 
*** they apply to both call stacks where the method is used, but i wnated ot 
write it once in one place

* versionAdd
** {{if (fetchedFromLeader == null)}}
*** I'm a little concerned/confused by the "else" block of this condition
**** grep code for "nocommit: OUTER IF", "nocommit: OUTER ELSE" & "nocommit: 
*** at this point, we've replaced our "in-place" update with a regular update 
using the full SolrInputDocument from the leader (by overwritting cmd.solrDoc, 
cmd.prevVersion, cmd.setVersion(...))
*** but this if/else is itself wrapped in a bigger {{if 
(cmd.isInPlaceUpdate())}} whose else clause does some sanity checking / 
processing logic for "// non inplace update, i.e. full document update"
*** shouldn't the logic in that outer else clause also be applied to our "no 
longer an in-place" update?

* fetchFullUpdateFromLeader
** this method still calls {{forceUpdateCollection}} ... note shalin's comments 
back in August...{quote}
Under no circumstances should we we calling `forceUpdateCollection` in the 
indexing code path. It is just too dangerous in the face of high indexing 
rates. Instead we should do what the DUP is already doing i.e. calling 
getLeaderRetry to figure out the current leader. If the current replica is 
partitioned from the leader then we have other mechanisms for taking care of it 
and the replica has no business trying to determine this.



* I previously asked about LUCENE-7344...{quote}
* so what's our Solr solution / workaround?  do we have any?
{color:green}Our workaround is to reopen an RT searcher during every DBQ (in 
the DUH2), as per Yonik's suggestion.{color}
** I gather, reading this patch and knowing the context of this issue, that the 
DUH2 line i added a "nocommit: LUCENE-7344" to is the line in question?
** in general, any code that exists purely to work around some other bug should 
have a have a big, bold, loud, comment explaining the purpose for it's 
existence and a link to the known bug
** specifically: the fact that {{grep LUCENE-7344 SOLR-5944.patch}} didn't 
match *ANY* lines in the last patch was a big red flag to me
*** the code should have a comment making it clear where/how/why/what the work 
around is
*** the test(s) written to ensure the workaround works should also have 
comments refering to the issue at hand (offhand, i personally couldn't identify 



* addField
** previous feedback/response...{quote}
* if {{true==forInPlaceUpdate}} and {{createFields(...)}} returns anything that 
is *not* a NumericDocValuesField (or returns more then one) shouldn't that trip 
an assert or something? (ie: doesn't that mean this SchemaField isn't valid for 
using with an in place update, and the code shouldn't have gotten this far?)
** {color:green}This is fine, since createFields() generates both NDV and non 
NDV fields for an indexable field, and the intent is to drop the non NDV one. 
Added a comment to this effect{color}
** I couldn't make sense of this until i realized what you're describing is 
evidently a bug in TrieField: SOLR-9809
** I've updated addField with a comment to make it clear we're working around 
that bug
** unless i'm missunderstanding something, fixing that bug could theoretically 
allow us to throw away most of the new DocumentBuilder logic in this patch 
(that's conditional on forInPlaceUpdate)
*** notable exception: optimizing away the default field construction
*** also: leaving the new code in will be handy for the the purposes of 
asserting that we only have the expected NumericDocValuesField instances



* applyPartialUpdates
** in reply to some previous questions/feedback...{quote}
{color:green}...It is my understanding is that due to the incref/decref being 
inside the synchronized(ulog) block, no other thread's action would affect 
whatever happens to these tlog once we've entered the block....{color}
*** the problem in previous patches was that the incref/decref were *NOT* 
inside the sync block -- which is why i asked the questions.  In this patch you 
moved the {{getEntryFromTLog}} call inside the same sync block where the 
{{List<TransactionLog>}} is created, so _now_ as {{getEntryFromTLog}} loops 
over the TransactionLog, we're safe -- but that wasn't happening before, hence 
the question.
**** I _think_ the way the code is writen right now is safe/valid, but i'd feel 
better about it if we could get some more eyeballs to review the sync usage.
*** Even if it is currently correct/valid, I still think it would be good to 
make the {{getEntryFromTLog}} method itself declared as {{synchronized}} -- it 
should have no impact on performance/correctness since the only existing usage 
is the {{synchronized (this)}} block, but it will protect us against future dev 
mistakes since hte _only_ valid usage of that method is when you have a lock on 
{{this}} (otherwise some other thread might be decrefing/closing the 
TransactionLog instances used in that method.
** some (different) feedback about this method from before...{quote}
"in general i'm wondering if lookupLogs should be created outside of the while 
loop, so that there is a consistent set of "logs" for the duration of the 
method call ... what happens right now if some other thread changes 
tlog/prevMapLog/prevMapLog2 in between iterations of the while loop?" 
{color:green}<-- I added them inside the while block so that if there have been 
multiple commits during the course of iterating the while loop, those tlogs 
wouldn't have grown stale. Do you think we should pull it out outside the while 
*** The concern i have is that AFAICT rebuilding the {{Arrays.asList(tlog, 
prevMapLog, prevMapLog2)}} in every iteration of the while loop seems to be at 
cross purposes with the reason why the while loop might have more then one 
*** as you said: if commits happen in other threads while the code in the loop 
is evaluating an entry to apply partial updates, then the _next_ iteration of 
the loop (if any) will get the updated tlog with new data
**** but the _reason_ there might be a "next" iteration of the loop is if the 
{{prevPointer}} of the our current {{IN_PLACE}} update identified an early 
(ancestor) {{IN_PLACE}} update -- having _newer_ commits on the "next" 
iteration of the loop doesn't help us. (does it?)
*** in other words: the purpose of the while loop is to be able to iterate 
backwards through time in the tlogs when many {{IN_PLACE}} updates are applied 
on top of eachother, while the purpose you listed for defining {{lookupLogs}} 
inside the loop is to support moving forward in time as new tlogs are created 
-- which actually _decreases_ the likelihood that we'll find some ancestor 
entry we have a cascading dependency on.
*** does that make sense?
**** Am i missunderstanding/missing some possible benefit of having the "newer" 
tlogs available as the while loop iterates back in time over older and older 
** assuming i'm correct about both of the above points...
*** {{List<TransactionLog> lookupLogs}} should be created once, outside of any 
looping, and never redefined to ensure that even if subsequent commits occur in 
concurrent threads, we can still trace the chain of dependent updates back as 
far as possible using the oldest known tlogs available to us when the method 
was called.
*** {{List<TransactionLog> lookupLogs}} and any calls to {{getEntryFromTLog}} 
that use that List *MUST* happen in the same sync block.
*** ergo the sync block must wrap the while loop ... ie: we might as well 
declare the entire method syncrhonized.



* isInPlaceUpdate
** regarding some previous feedback...{quote}
couldn't this list of fields be created by the caller and re-used at least for 
the entire request (ie: when adding multiple docs) ? {color:green}The set 
returned is precomputed upon the opening of a searcher. The only cost I see is 
to create a new unmodifiableSet every time. I'd prefer to take up this 
optimization later, if needed.{color}{quote}
*** the method you're talking about isn't even used in this patch -- as you 
noted in the comment you made immediately after that one, you changed it to use 
{{IndexWriter.getFieldNames()}} instead.
*** {{IndexWriter.getFieldNames()}} delegates to a syncronized method, which 
means there is definitely overhead in calling it over and over for every field 
name in every document in the request.
*** even if you don't think it's worth refactoring this method so that 
{{IndexWriter.getFieldNames()}} only needs to be called once per 
SolrQueryRequest, it seems irresponsible not to at least call it *once* in this 
method (ie: once per doc) before looping over all the fields ... so i made that 
change as well.
** regarding some other previous feedback...{quote}
* {{if (indexFields.contains(fieldName) == false && 
** why does it matter one way or the other if it's a dynamicField? 
{color:green}Changed the logic to check in the IW for presence of field. Added 
a comment: "// if dynamic field and this field doesn't exist, DV update can't 
*** that doesn't really answer my question at all: *why* does it matter if it's 
a dynamic field?
*** Doesn't the same problematic situation happen if it's an explicitly defined 
{{<field .. />}} in the schema.xml, but just doesn't happen to exist yet in any 
docs? -- ie: if the very first update to the index is:{code}
*** even if i'm wrong: consider this code (and the comment you added) from the 
point of view of someone completely unfamiliar with this issue.  if they come 
along, and see this code jumping through some hoops related to dynamicFields, 
and the only comment is (essential) "doesn't work for dynamic fields" that 
comment isn't going to help them make any sense of the code -- it would be like 
if some code had a comment that said {{// set the value of the variable 
will_not_work to true}}, it doesn't explain _why_ that line of code exists. 
**** I've been working on this issue with you for months, I'm intimately 
familiar with this patch, but reading this code now I can't recall any 
particular reason *why* "DV update can't work" specifically in the case of 
"*dynamic field* _and this field doesn't exist_"
**** i thought the only known problem like this was just "this field doesn't 
yet exist in the index" in general? (and that we had a check for that problem 
at a higher level then this method)
*** if i am wrong, and there is an important distinction between dynamicFields 
and "static" fields when dealing with the situation of in-place updates being 
attempted against an index that doesn't yet contain them, then that makes me 
very concerned that most of the existing testing depends solely on 
dynamicFields (or in some cases solely on static fields)
**** an easy solution would be to replace {{schema-inplace-updates.xml}} with 2 
variants: one that _only_ has the {{<dynamicField/>}} declarations needed to 
match all the field names used in the tests, and one that has an explicit 
{{<field />}} declaration for every field name used in the tests.  and then all 
the tests could randomize between {{schema-inplace-updates-dynamicfields.xml}} 
and {{schema-inplace-updates-staticfields.xml}} in their {{@BeforeClass}}
** explicitly throwing {{new RuntimeException}} is a big red flag.
*** if we're going to catch {{IOException}} and wrap it in a 
{{RuntimeException}} we should at least use {{SolrException}} instead and use a 
message that provides context
*** since the only caller of this method ({{DUP.getUpdatedDocument}}) already 
declares that it throws {{IOException}}, why not just let this method propogate 
the {{IOException}} as is?

* doInPlaceUpdateMerge
** a previous question...{quote}
why are we copying all supported DVs over? {color:green}If we update dv1 in one 
update, and then update dv2, and again update dv1 (without commits in between), 
the last update would fetch from the tlog the partial doc for dv2 update. If 
that doc doesn't copy over the previous updates to dv1, then a full resolution 
(by following previous pointers) would need to be done to calculate the dv1 
value. Hence, I decided to copy over older DV updates.{color}
*** that makes perfect sense -- that explanation should be part of the code 
comment so a future dev doesn't have the same mistaken impression i did that it 
was copying DVs that aren't needed -- so i added it to the code as a new java 



* in general there was a lot of unneccessary stuff in this schema copy/pasted 
from schema-minimal-atomic-stress.xml
** even some comments that refered to test classes that were using 
schema-minimal-atomic-stress.xml and had nothing to do with this config
* my previous suggestion was "add a new, small, purpose built & well commented 
schema-inplace-atomic-updates.xml file just for the new InPlace tests – 
containing only the fields/dynamicFields needed" ... it's not "small and 
purpose build" if it's got useless stuff copy/pasted from other files.
** leaving unused fields/types here defeats the entire point of my suggestion: 
"having smaller single purpose test configs makes it a lot easier for people 
reviewing tests to understand what the minimum expectations are for the test to 
** i went ahead and cleaned up everything that wasn't needed for the new tests, 
and added a few comments as appropriate to make it clear to future readers why 
some stuff is the way it is.

* regarding the copyField testing & schema comment you had...
** i had to re-read the schema and the test involved at least 10 times before i 
could make sense of it
** the crux of my confusion was that in the second copyField:
*** the src field named {{copyfield_not_updateable_src}} *did* support in place 
updates (in spite of it's name)
*** the dest field named {{copyfield_updatable_target_i}} *did not* support in 
place updates (in spite of it's name)
** i renamed the fields used by the test to try and be more clear what was 
going on


* previous feedback...{quote}
* why is {{// assert that in-place update is retained}} at teh end of the test, 
instead of when the (first) log replay happens and \*:\* -> numFound=3 is 
asserted? (ie: before "A2" is added) {color:green}FIXED{color}
* what about mixing in-place updates with deletes (both by id and query?) 
* neither of those responses made any sense to me.
** RE: {{// assert that in-place update is retained}}
*** the latest patch included an additional copy of the same assert -- but it 
was being checked after several other updates
*** I couldn't see any reason why that same assert (that the in-place update is 
retained) shouldn't happen as soon as recovery is finished, before doc "A2" was 
**** So I added this -- and everything seems to be working fine.
** RE: mixing in-place updates with deletes
*** i don't understand how this is "FIXED" ... there's been no additions to 
this test since the last patch that relate to deletes at all ???
*** as far as i can tell, there is still no tests actually verifying that log 
replay will do the correct when deletes are intermixed with in-place updates, 
**** DBQs against a field that had in-place updates
**** a delete by id of a document inbetween two "atomic updates" of that 
document (to ensure the replay recognizes that the second update can't be done 
**** etc...



* lifecycle of requests
** in my last feedback, i mentioned that there are a lot of 
LocalSolrQueryRequest that were created and never closed, and said...{quote}
the convinience methods that build up UpdateCommands (ulogCommit & 
getDeleteUpdate & getAddUpdate) sould be refactored to take in a 
SolrQueryRequest which should be closed when the logical bit of code related to 
testing that UpdateCommand is finished. something like... {code}
try (SolrQueryRequest req = req()) {
  ulog.add(getAddUpdate(req, ...));
  // TODO: anything else that needs to be part of this same request for test 
// TODO ... assert stuff about ulog
** in your updated patch, the SolrQueryRequest's are getting auto-closed -- but 
that's happening in the static helper methods (ie: {{getAddUpdate}}, 
{{getDeleteUpdate}}) that then return an UpdateCommand which still refrences & 
uses the SolrRequest.  So the lifecycle of the request is still violated, just 
in a diff way
*** now instead of not closing them, we're closing them before we're done using 
** any logic that's being tested that would normally be pat of the request -- 
like adding an UpdateCommand to the ulog -- needs to happen before the 
SolrQueryRequest is closed, or the test isn't valid -- it could be hiding bugs 
that exist in the normal lifecycle (ie: related to re-opened searchers, etc...)
** based on the current code, the easiest fix maybe...
*** rename {{getAddUpdate}}, {{getDeleteUpdate}} --> {{applyAddUpdate}}, 
*** make both methods take in an {{UpdateLog}} param and call the appropirate 
{{ulog.foo(...)}} method inside their try blocks
*** change {{applyDeleteUpdate}} to return void
*** change {{applyAddUpdate}} to return {{cmd.getIndexedId()}} so it can be 
used in subsequent {{ulog.lookup(...)}} calls for assertions about the state of 
the ulog
** but i'm not certain, because i think these static methods are also used in 
other tests?
*** so perhaps my original suggestion is still the only valid one: caller needs 
to pass in the SolrQueryRequest ?
** grep test for "nocommit: req lifecycle bug" 

* regarding using expectThrows:{quote}
{color:green}Couldn't use expectThrows as suggested, since it required 
variables like prevPointer, prevVersion, partialDoc etc. to be final variables, 
and changing them to such would make things ugly. Maybe I missed 
** i'm not sure what you tried, but variables used in lambda bodies must be 
*effectively* final, and in this case all of the variables you're refering to 
are effectively final at the point where we need to be using expectThrows.
*** as a general rule of thumb: if you want to use {{expectThrows}} and the 
variables you need to assert things about aren't effectively final, that's a 
code smell that you should consider breaking up your unit test into more 
discreete test methods.
*** perhaps this is exactly the situation you ran into? you tried using 
expectThrows before splitting up the method and had difficulties, but now that 
the test methods are more discrete I was able to switch it to use 
expectThrows() no problem.

* regarding DBQ testing: {quote}
what about testing DBQs in the ulog and how they affect partial updates? 
{color:green}Any DBQ causes ulog to be cleared after it (irrespective of 
in-place updates, irrespective of whether any documents were deleted or not). 
Only that effect can be tested, and I've added a test for it.{color}{quote}
** i ment doing DBQs against a field/value that has in-place updates already in 
the tlog, and verifing that subsequent ulog.lookup calls do/don't match on that 
docId depending on wether the DBQ matched the old/new value after the in-place 
*** the test you added (testApplyPartialUpdatesWithDBQ) just does a DBQ against 
the uniqueKey field -- in spite of it's name, there isn't a single partial 
update anywhere in the test!
** if there's really nothing we can do with an UpdateLog after adding *any* DBQ 
to it (ie: all lookups will fail), then that's not at all obvious from this 
test -- instead:
*** if the behavior of a DBQ is completley independent of any in-place updates 
the method should be renamed
*** the asserts should make it clear the ulog has been cleared -- add more docs 
first and then check that all lookups fail, verify directly that a new RT 
searcher really has been opened (compare the currently used RT searcher to an 
instance from before the DBQ?), etc...



* some previous feedback i gave on one particular test...{quote}
please use expectThrows here, and actually validate that the Exception we get 
is a SolrException with code=409 {color:green}FIXED: checked message contains 
** that's a really weak assertion considering it's more correct, less brittle, 
and less code, to just use something like {{assertEquals(409, 
exception.code())}} like i suggested.
** I went ahead and audited the all the new test code to ensure all of the 
expectThrows usage includes an assert that it has the expected error code (400, 
409, 500, etc...) in addition to what ever existing message assertions make 

* getFieldValueRTG
** this method seems to have a lot of complexity, and requires a lot of 
complexity in all of the caller code, to work around the fact that it's using 
really low level access and dealing directly with the 
StoredField/NumericDocValuesField/LegacyField type objects that exist in the 
SolrInputDocuments that exist in the tlog
** it seems like it would be _MASSIVELY_ simpler to just initialize a static 
SolrCLient in the \@BeforeClass method along the lines of: {{solrClient = new 
EmbeddedSolrServer(h.getCoreContainer(), h.coreName);}} and then just use 
{{SolrClient.getById}} when we need values from an RTG
*** grep other solr/core tests for {{EmbeddedSolrServer}} to see examples of 

* testReplay
** i previously gave the following feedback...{quote}
testReplay3, testReplay4, testReplay6, testReplay7
* these should have better names {color:green}Fixed{color}
** instead of giving the methods descriptive names making it clear what they 
tested, you seem to have combined them into one big generic {{testReplay}} 
** that's pretty much the opposite of what i suggested
** if {{testReplay}} failed, it won't be particularly clear what aspect of the 
functinality encoutered a problem w/o reading through all of the steps, and if 
it fails early in the test (ie: the first {{checkReplay}} call) we won't have 
any idea if only that particularly sequence failed, or if the other sequences 
futher down in the test would also fail
** that's the main value add in having discrete, independent, tests: being able 
to see when more then one of them fail as a result of some change.
*** {{testReplay}} as written was weaker then the 4 previous individual tests 
-- even if they didn't have particularly distinctive names.
*** at least before, if something caused testReplay3 and testReplay6 to fail, 
but testReplay4 and testReplay7 still passed, that would have given us a 
starting point for investigating what those methods had in common, and how they 
** I went ahead and re-split this into 4 distinct test methods

* previous comment...{quote}
I'd still like to see some 100% randomized testing using checkReplay.  It's a 
really powerful helper method -- we should be taking full advantage of it. 
{color:green}TODO: Can be added later{color}
** considering how many weird edge case bugs we've found (both in the new solr 
code and in the existing IndexWriter level DB updating code) over the lifespan 
of this jira, that required very particular and specific circumnstances to 
manifest, I'm in no way comfortable committing this patch as is with the level 
of testing it has currently
** that's why i keep saying we need more (solid) randomized & reproducible (ie: 
not cloud based where node routing might introduce subtle non-reproducibility) 
testing, and {{checkReplay}} is the starting point of doing that -- it already 
takes care of the hard work, we just need some scaffolding to generate a random 
sequence of updates to pass to it, and a loop to do lots of iterations with 
diff random sequences.



* some previously feedback & responses...
** {quote}"operations" comment is bogus (it's not just for queries) 
*** how was this fixed? the comment on the operations variable in the last 
patch still had the exact same comment indicating it's a number of queries -- i 
corrected it
** {quote}
I'm not convinced the "{{synchronize \{...\}; commit stuff; syncrhonize \{ ... 
\};}}" sequence is actually thread safe...
* T-W1: commit sync block 1: newCommittedModel = copy(model), version = 
* T-W2: updates a doc and adds it to model
* T-W1: commit
* T-W1: commit sync block 2: committedModel = newCommittedModel
* T-R3: read sync block: get info from committedModel
* T-R3: query for doc
* ...
{color:green}Copied the logic straight from TestStressReorder. Maybe we can 
revisit both together, later, in light of your concerns?{color}
*** broken code is broken code.  It's bad enough it exists at all, it shouldn't 
be copied if we know it's broken.
*** I've described in detail a specific example of how the synchronization 
approach in the code is invalid acording to my understanding of the code and 
the docs for the relevant methods -- if you can explain to me some way(s) in 
which my example is inaccurate, or involves some missunderstanding of the 
relevant documentation then great! -- we can leave the code the way it is.  
Otherwise it's broken and needs to be fixed.
*** there's no point in adding (more!) tests to Solr that we know for a fact 
are unreliable as designed.
** {quote}
having at least one pass over the model checking every doc at the end of the 
test seems like a good idea no matter what {color:green}FIXED{color}
*** how is this fixed?
*** the very last thing in the {{stressTest}} method is still the loop that 
joins on all the threads.
** {quote}I'm certain the existing "synchronized (model)" block is not thread 
safe relative to the synchronized blocks that copy the model into 
commitedModel, because the "model.put(...)" calls can change the iterator and 
trigger a ConcurrentModificationException {color:green}Never saw a CME occur in 
thousands of rounds of this test. I think if a CME were to be thrown ever, it 
would've been quite frequent.{color}
*** FWIW: Just because you never saw a CME in your testing doesn't make the 
test valid -- it could be more likely to trigger on diff OS/architectures
**** You can't *disprove* a concurrency bug in code just by saying "i ran it a 
bunch and never saw a problem"
**** Hell: I could write a test that does {{assertTrue(0 != 
random().nextInt()}} and run it for thousands of rounds w/o ever getting a 
failure -- but that doesn't make the test correct.
*** That said: I reviewed this particular concern again just to be sure and I'm 
pretty sure my initial concern was invalid: I'm guessing i didn't notice before 
that {{model}} is a {{ConcurrentHashMap}} so {{HashMap}}'s copy constructor 
will get a weakly consistent iterator and never throw CME.
**** I added a comment making this obvious to the next person to make the same 
mistake i did.



* starting with a random index to help test edge cases
** a key piece of feedback i gave about this test before was:{quote}
* even if we're only going to test -one- _a few_ doc, we should ensure there 
are a random num docs in the index (some before the doc we're editing, and some 
after) {color:green}FIXED{color}
* _2 docs before/after is not a random number ... random means random: we need 
to test edge cases of first docid in index, last docid in index, first/last 
docid in segment, etc..._
** I mentioned this about multiple test methods, including 
** this is how outOfOrderUpdatesIndividualReplicaTest started in the last 
  private void outOfOrderUpdatesIndividualReplicaTest() throws Exception {

    index("id", 0, "title_s", "title0", "id_i", 0);

    float ratings = 1;

    // update doc, set
    index("id", 0, "ratings", map("set", ratings));
*** the index contained exactly one document: id=0
*** that document started w/o a value in the field we want to update in place
*** then we do an inplace update the document to set a value on it, and then 
"..." more testing of id=0
** this is what the start of that method looks like in the current 
  private void outOfOrderUpdatesIndividualReplicaTest() throws Exception {

    index("id", 0, "title_s", "title0", "id_i", 0);

    float inplace_updatable_float = 1;

    // Adding random number of docs before adding the doc to be tested
    int numDocsBefore = random().nextInt(1000);
    for (int i=0; i<numDocsBefore; i++) {
      index("id", 1000 + i, "title_s", "title" + (1000+i), "id_i", 1000 + i);

    // update doc, set
    index("id", 0, "inplace_updatable_float", map("set", 

    // Adding random number of docs after adding the doc to be tested
    int numDocsAfter = random().nextInt(1000);
    for (int i=0; i<numDocsAfter; i++) {
      index("id", 5000 + i, "title_s", "title" + (5000+i), "id_i", 5000 + i);
*** the test still starts out by adding doc id=0 first before any other docs
*** that means that we're still only realy testing a in-place update of the 
first document in the index
*** we're only getting marginal benefit out of the "randomzied" index -- namely 
that the size of the index is random. but the doc getting updated in-place is 
always the *first* doc in the index.  we're never testing an in-place update of 
a doc in the second segment, or the last doc in a segment, or the last doc in 
an index, etc...
** as written, this basically defeats the entire point of why we should use a 
random index
** that said: at least it gives us *some* randomization
** most other test methods in this class still don't even have that
** ensureRtgWorksWithPartialUpdatesTest does seem to have good initial initial 
index setup code:
*** adding a random number of docs
*** then add a "special" doc to be tested (w/o the field to be updated inplace)
*** then add some more random documents
** why not refactor that index creation code from 
ensureRtgWorksWithPartialUpdatesTest into a helper method and use it in all of 
the other test methods?
*** these tests all still uses an index containing exactly 1 doc and should 
also have a random index:
**** outOfOrderDBQsTest 
**** outOfOrderDeleteUpdatesIndividualReplicaTest
**** reorderedDBQsWithInPlaceUpdatesShouldNotThrowReplicaInLIRTest
**** delayedReorderingFetchesMissingUpdateFromLeaderTest
*** docValuesUpdateTest could use the same index building helper method even if 
doesn't follow the same pattern of only testing a single "special" doc

* ExecutorService
** I feel like i've mentioned this every time i've given feedback: {quote}
* if we are going to use an ExecutorService, then the result of 
awaitTermination has to be checked {color:green}FIXED{color}
* ... and shutdown & awaitTermination have to be called in that order
** the latest patch had 2 places where threadpools were shutdown & 
awaitTermination correctly -- but there were still 4 more that were broken.
** I went ahead and fixed them.

* docValuesUpdateTest
** knowing when all the updates have propogated...
*** I had previously (twice) made the following suggestion: {quote}
* In general, this seems like a convoluted way to try and kill two birds with 
one stone: 1) make sure all replicas have opened searchers with the new docs; 
2) give us something to compare to later to ensure the update was truely in 
** i really think the first problem should be addressed the way i suggested 
bq. * if initially no docs have a rating value, then make the (first) test 
query be for {{rating:\[\* TO \*\]}} and execute it in a rety loop until the 
numFound matches numDocs.{color:green}Done{color}
bq. * likewise if we ensure all ratings have a value such that abs(ratings) < 
X, then the second update can use an increment such that abs(inc) > X\*3 and we 
can use {{-ratings:\[-X TO X\]}} as the query in a retry loop 
** the second problem should be solved by either using the segments API, or by 
checking the docids on _every_ replica (w/o any retries) ... _after_ 
independently verifying the searcher has been re-opened.
** in the current patch you updated the {{// Keep querying until the commit is 
distributed and expected number of results are obtained}} code to use a range 
query like i suggested, but: that code is useless because it assumes that those 
docs won't match the query until the updates propogate, which isn't true the 
way the test is currently written...
*** the index starts off with every document having a value in the field being 
queried against, so even if none of the in-place updates happen, the query will 
still succeed.
*** either all of the docs need to have no-values in the field before the 
in-place updates are attempted, or the test query needs to be tightened to only 
match the docs if they contain the updated values
**** example: currently all docs start with a value of {{101.0}} and the 
in-place updates set every doc to a value <= {{5.0}}, so we can make the 
verification of the first update be {{"inplace_updatable_float:\[\* TO 5.0\]}}
** The verification of the second update looks  seems mostly ok, but it seems a 
little more complicated then it needs to be and i think there is at least one 
subtle bug...
*** {{Math.max(Collections.max(valuesList), 
Math.abs(Collections.min(valuesList)));}} ?
**** {{Collections.max(valuesList)}} might also be negative, we need to take 
the {{abs}} of both before values before deciding which is bigger.
*** in general it seems overly complicated to worry about determining X 
**** the code that does the original update 
({{valuesList.add(r.nextFloat()*5.0f);}}) ensures that the values are all 
between 0 and 5.0 -- let's assume we tweak that to randomly use -5.0 sometimes 
so our range is -5.0 to 5.0
**** the code that does the second update can just use {{inc = atLeast(10) * 
(r.nextBoolean() ? 1 : -1);}}
**** then the code verifying the second update propogates can just be 
{{"+inplace_updatable_float:\[\* TO \*\] -inplace_updatable_float:\[-15 TO 
** in both cases however, only executing the query against {{clientToTest}} is 
a bad idea.  We need to be running the verification against *ALL* the replicas, 
otherwise we can get false positives from the subsequent validation (ie: the 
thread scheduling might result in {{clientToTest}} being completely finished 
with all the updates and the commits, while some other replica is still working 
on it and hasn't committed yet when we do our validation queries.
*** there's also the problem that the {{clients}} list includes the LEADER, so 
1/3 of the time these queries aren't even hitting a replica and are useless.
** the second part of my suggestion doesn't seem to have been implemented, nor 
did it get any comment:{quote}
* the second problem should be solved by either using the segments API, or by 
checking the docids on _every_ replica (w/o any retries) ... _after_ 
independently verifying the searcher has been re-opened.
*** ditto the problem of picking a random {{clients}} and 1/3 of the time 
geting hte leader making the {{matchResults()}} call useless.
*** even if we switch the randomization to use {{NONLEADERS}} there's no good 
reason to be comparing with only one replica -- looping over {{NONLEADERS}} is 
at worst "not harder" then picking a random replica, and makes the test twice 
as likely to catch any potential bugs that might cause a replica to get out of 

* getInternalDocIds
** this looks like it exists because of this prior feedback for 
checking {{fl=\[docid\]}} in cloud RTG is currently broken (SOLR-9289) but we 
can/should be checking via the segments API anyway (if we have a general helper 
method for comparing the segments API responses of multiple replicas betwen 
multiple calls, it could be re-used in every test in this 
class){color:green}Done by opening a new RT searcher and getting the docid for 
** but SOLR-9289 has been fixed since July ... there's no reason to jump 
through hoops like this instead of just doing an RTG request with 
{{fl=\[docid\]}} ... it just makes the test more brittle to refactoring (into a 
true cloud test) later.

* ensureRtgWorksWithPartialUpdatesTest
** as mentioned breviously: comparing results from {{LEADER}} with results from 
a random {{clients}} is useless 1/3 of the time since that list contains the 
*** the test will be a lot stronger and no longer if instead it loops over the 
clients and does a getById & asserting the results inside the loop.
** see comments above regarding getInternalDocIds - rather then fixing that 
method it might be easier to just do direct {{getById}} calls here
*** already doing {{LEADER.getById("100");}} -- just add {{fl=\[docid\]}} to 
those and add some new similar {{getById}} calls to the replicas to compare 
with the existing {{getById}} validation calls to them as well 
*** although: shouldn't *ALL* the {{getById}} calls in this method be replaced 
with {{getReplicaValue}} or {{assertReplicaValue}} calls ... isn't that what 
those methods are for? don't they ensure that RTG calls to replicas aren't 
internally forwarded to the LEADER?

* addDocAndGetVersion...{quote}
{{synchronized (cloudClient)}}
* why are we synchronized on cloud client but updating via LEADER?
* why are we synchronized at all?
** Nothing was fixed about this -- it's still synchronizing on cloudClient .. 
does it need sync at all?????


I plan to keep working on this issue for at least a few more days -- starting 
with some of the ranodmized test improvements I mentioned (above) being really 
concerned with.

> Support updates of numeric DocValues
> ------------------------------------
>                 Key: SOLR-5944
>                 URL: https://issues.apache.org/jira/browse/SOLR-5944
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ishan Chattopadhyaya
>            Assignee: Shalin Shekhar Mangar
>         Attachments: DUP.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, 
> TestStressInPlaceUpdates.eb044ac71.beast-167-failure.stdout.txt, 
> TestStressInPlaceUpdates.eb044ac71.beast-587-failure.stdout.txt, 
> TestStressInPlaceUpdates.eb044ac71.failures.tar.gz, defensive-checks.log.gz, 
> hoss.62D328FA1DEA57FD.fail.txt, hoss.62D328FA1DEA57FD.fail2.txt, 
> hoss.62D328FA1DEA57FD.fail3.txt, hoss.D768DD9443A98DC.fail.txt, 
> hoss.D768DD9443A98DC.pass.txt
> LUCENE-5189 introduced support for updates to numeric docvalues. It would be 
> really nice to have Solr support this.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to