[
https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hoss Man updated LUCENE-7344:
-----------------------------
Attachment: LUCENE-7344.patch
LUCENE-7344_hoss_experiment.patch
Something yonik pointed out in SOLR-5944 is that Solr could possibly work
around this issue if we always opened an NRT Searcher/Reader prior to doing any
deleteByQuery calls -- but that ideally we could do that conditionally based on
knowing if/when any updateDocVal calls had happened to the IndexWriter since
the last time an NRT Searcher/Reader was opened.
I didn't really understand that comment at the time, but talking about it
offline with shalin/ishan a bit I realized it's essentially just doing what
Shai suggested was the "To properly solve this problem..." solution, but by
putting the burden on the caller to "force" the IndexWriter to reopen it's
LeafReaders to return an uptodate NRT Reader, and then IndexWriter will use
those uptodate LeafReaders for applying subsequent deletes.
The LUCENE-7344.patch I'm attaching here includes a proof of concept of this
workaround -- written largely to convince myself it would actually work....
In this LUCENE-7344.patch, I have:
* reverted Shai's changes in previous patches to BufferedUpdatesStream (so that
Deletes by Query are always applied before updateDocValues just like on master
today)
* updated TestNumericDocValuesUpdates so that an (unused) NRT reader is always
opened before any deleteDocuments(Query) call that depends on a previous
updateDocVals call (see {{FORCE_NEW_READER_HACK_PRE = true}})
* Updated biasedMixOfRandomUpdates to assert the sequence numbers returned by
all IndexWriter method calls have the expected values
** This is unrelated to the other changes, It's just something I realized was a
good idea to include as a sanity check in this test.
...with these changes, all tests seem to pass reliably.
Assuming we document that forcing a reader reopen like this is required for
users in this situation (ie: a caveat on the updateDocVals javadocs that doing
something like this is required in order for subsequent deleteDocuments(Query)
calls to succeed if the delete Query refers to the updated DocValues field)
that could get us most of the way towards a low-impact (on
performance/complexity of IndexWriter) solution unless/until Shai can a more
optimized solution working -- at which point we could just remove the caveat
from the javadocs.
But...
I'm leary of a solution that:
# relies on users to read javadocs carefully
# relies on callers of deleteDocuments(Query) to know/realize if/when that
Query involves DocValues *and* know if any _other_ callers may have used
update(Binary|Numeric)DocValues
#* particularly since the "Query" may have come from a upstream code
# results in silent unexpected data loss when the rules spelled out in the
javadocs aren't followed.
So with that in mind, I started wondering what it would take to:
* Provide a method to indicate if there are updateDocVal calls "pending", so
callers could conditionally decide to force reopen the NRT reader before doing
a deleteDocuments(Query)
* Throw an exception if a caller attempts to violate the "rules" by calling a
deleteDocuments with a Query that involves a DocValues field with "pending"
updates w/o forcting the NRT reader to reopen.
The first seemed like it would be somewhat straightforward, and not strictly
neccessary for corectness, So I tried tackling the second one.
This experiment can be seen in the attached LUCENE-7344_hoss_experiment.patch
(apply *after* LUCENE-7344.patch)
The crux of the change is adding new code to
BufferedUpdatesStream.applyQueryDeletes to check if there are any DVUpdates,
and if there are it wraps the LeafReader in a FilterLeafReader that throws an
Exception if LeafReader.get(Binary|Numeric)DocValues(...) is called on any of
the field names used in the DVUpdates -- so any usage of the DocValues on those
fields in the deletion Query will trigger an exception and the callers will
know something is wrong.
For the most part this experiment seems viable -- except that the new Exception
also gets thrown on writer.close() (or when opening an NRT reader) when there
have been dvUpdates _after_ a deleteDocuments(Query) that uses the same
DocValues field -- for example:
{code}
writer.addDocument(doc(0));
writer.deleteDocuments(DocValuesRangeQuery.newLongRange("val", 5L, 10L,
true, true));
writer.updateNumericDocValue(new Term("id", "doc-0"), "val", 7L);
// ...this would cause my new error checking code to throw
CanNotDeleteByQueryOnUpdatedDocValues
writer.close();
{code}
I'm not really sure why that's happening, since (afaict) the
deleteDocuments(Query) should have already been applied and been taken into
account by a (reopened) reader, so it seems like the queriesIter passed to
applyQueryDeletes when the writer is closed should be empty? ... but whatever
the reason, it can also be worked around by forcing a new reader _after_ any
deleteDocuments(Query) that will later be updated by an updateDocVal (or,
depending on how you look at it, _before_ the updateDocVal -- toe-MAY-toe,
toe-MAH-toe) ... see {{FORCE_NEW_READER_HACK_POST = true}}
----
In any case, I'm still hoping Shai's approach at an "internal" fix pans out --
but I wanted to toss this out as a strawman solution that seems better then
"don't call these methods in this ways or you may silently lose data"
> Deletion by query of uncommitted docs not working with DV updates
> -----------------------------------------------------------------
>
> Key: LUCENE-7344
> URL: https://issues.apache.org/jira/browse/LUCENE-7344
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Ishan Chattopadhyaya
> Attachments: LUCENE-7344.patch, LUCENE-7344.patch, LUCENE-7344.patch,
> LUCENE-7344.patch, LUCENE-7344.patch, LUCENE-7344.patch,
> LUCENE-7344_hoss_experiment.patch
>
>
> When DVs are updated, delete by query doesn't work with the updated DV value.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]