[ 
https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated LUCENE-7344:
-----------------------------
    Attachment: LUCENE-7344.patch
                LUCENE-7344_hoss_experiment.patch


Something yonik pointed out in SOLR-5944 is that Solr could possibly work 
around this issue if we always opened an NRT Searcher/Reader prior to doing any 
deleteByQuery calls -- but that ideally we could do that conditionally based on 
knowing if/when any updateDocVal calls had happened to the IndexWriter since 
the last time an NRT Searcher/Reader was opened.

I didn't really understand that comment at the time, but talking about it 
offline with shalin/ishan a bit I realized it's essentially just doing what 
Shai suggested was the "To properly solve this problem..." solution, but by 
putting the burden on the caller to "force" the IndexWriter to reopen it's 
LeafReaders to return an uptodate NRT Reader, and then IndexWriter will use 
those uptodate LeafReaders for applying subsequent deletes.

The LUCENE-7344.patch I'm attaching here includes a proof of concept of this 
workaround -- written largely to convince myself it would actually work....

In this LUCENE-7344.patch, I have:

* reverted Shai's changes in previous patches to BufferedUpdatesStream (so that 
Deletes by Query are always applied before updateDocValues just like on master 
today)
* updated TestNumericDocValuesUpdates so that an (unused) NRT reader is always 
opened before any deleteDocuments(Query) call that depends on a previous 
updateDocVals call (see {{FORCE_NEW_READER_HACK_PRE = true}})
* Updated biasedMixOfRandomUpdates to assert the sequence numbers returned by 
all IndexWriter method calls have the expected values
** This is unrelated to the other changes, It's just something I realized was a 
good idea to include as a sanity check in this test.

...with these changes, all tests seem to pass reliably.

Assuming we document that forcing a reader reopen like this is required for 
users in this situation (ie: a caveat on the updateDocVals javadocs that doing 
something like this is required in order for subsequent deleteDocuments(Query) 
calls to succeed if the delete Query refers to the updated DocValues field) 
that could get us most of the way towards a low-impact (on 
performance/complexity of IndexWriter) solution unless/until Shai can a more 
optimized solution working -- at which point we could just remove the caveat 
from the javadocs.

But...

I'm leary of a solution that:
# relies on users to read javadocs carefully
# relies on callers of deleteDocuments(Query) to know/realize if/when that 
Query involves DocValues *and* know if any _other_ callers may have used 
update(Binary|Numeric)DocValues
#* particularly since the "Query" may have come from a upstream code
# results in silent unexpected data loss when the rules spelled out in the 
javadocs aren't followed.

So with that in mind, I started wondering what it would take to:
* Provide a method to indicate if there are updateDocVal calls "pending", so 
callers could conditionally decide to force reopen the NRT reader before doing 
a deleteDocuments(Query)
* Throw an exception if a caller attempts to violate the "rules" by calling a 
deleteDocuments with a Query that involves a DocValues field with "pending" 
updates w/o forcting the NRT reader to reopen.

The first seemed like it would be somewhat straightforward, and not strictly 
neccessary for corectness, So I tried tackling the second one.

This experiment can be seen in the attached LUCENE-7344_hoss_experiment.patch 
(apply *after* LUCENE-7344.patch)

The crux of the change is adding new code to 
BufferedUpdatesStream.applyQueryDeletes to check if there are any DVUpdates, 
and if there are it wraps the LeafReader in a FilterLeafReader that throws an 
Exception if LeafReader.get(Binary|Numeric)DocValues(...) is called on any of 
the field names used in the DVUpdates -- so any usage of the DocValues on those 
fields in the deletion Query will trigger an exception and the callers will 
know something is wrong.

For the most part this experiment seems viable -- except that the new Exception 
also gets thrown on writer.close() (or when opening an NRT reader) when there 
have been dvUpdates _after_ a deleteDocuments(Query) that uses the same 
DocValues field -- for example:
{code}
    writer.addDocument(doc(0));
    writer.deleteDocuments(DocValuesRangeQuery.newLongRange("val", 5L, 10L, 
true, true));
    writer.updateNumericDocValue(new Term("id", "doc-0"), "val",  7L);
    
    // ...this would cause my new error checking code to throw 
CanNotDeleteByQueryOnUpdatedDocValues
    writer.close(); 
{code}

I'm not really sure why that's happening, since (afaict) the 
deleteDocuments(Query) should have already been applied and been taken into 
account by a (reopened) reader, so it seems like the queriesIter passed to 
applyQueryDeletes when the writer is closed should be empty? ... but whatever 
the reason, it can also be worked around by forcing a new reader _after_ any 
deleteDocuments(Query) that will later be updated by an updateDocVal (or, 
depending on how you look at it, _before_ the updateDocVal -- toe-MAY-toe, 
toe-MAH-toe) ... see {{FORCE_NEW_READER_HACK_POST = true}}

----

In any case, I'm still hoping Shai's approach at an "internal" fix pans out -- 
but I wanted to toss this out as a strawman solution that seems better then 
"don't call these methods in this ways or you may silently lose data"



> Deletion by query of uncommitted docs not working with DV updates
> -----------------------------------------------------------------
>
>                 Key: LUCENE-7344
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7344
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Ishan Chattopadhyaya
>         Attachments: LUCENE-7344.patch, LUCENE-7344.patch, LUCENE-7344.patch, 
> LUCENE-7344.patch, LUCENE-7344.patch, LUCENE-7344.patch, 
> LUCENE-7344_hoss_experiment.patch
>
>
> When DVs are updated, delete by query doesn't work with the updated DV value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to