Re: Updating document with IndexWriter#updateDocument doesn't seem to take effect

Gautam Worah Sat, 10 Aug 2024 12:05:09 -0700

Hey,

I don't think I understand the email well but I'll try my best.


In your printed docs, I see that the flag data is still tokenized. See the
string that you printed: DOCS<id:flags-1-1>
stored,indexed,tokenized,omitNorms. What does your code for adding the doc
look like?
Are you using StringField for adding the field to the doc?

I think this is why when you re-add the field with a StringField, the test
works.

Lucene's StandardTokenizer for 9.11 uses the Unicode Text Segmentation
algorithm, as specified in Unicode Standard Annex #29
<http://unicode.org/reports/tr29/>.
That standard contains a "-" as a word breaker.

I guess that is what is breaking your code.

You are using Lucene's NRT for your search. In general, for debugging such
cases, I add an IndexWriter.commit() after you are done updating the doc,
and see if it fixes things.
If it does, then it has something to do with NRT, and deleting docs etc. If
not, then that means that your query/data is wrong somewhere. This is how I
debugged your first problem.

Best,
Gautam Worah.


On Sat, Aug 10, 2024 at 4:17 AM Wojtek <woj...@unir.se> wrote:

> Addendum, output is:
>
> ```
>
> maxDoc: 3
>
> maxDoc (after second flag): 3
>
> Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\FLAG>
> stored<uid:1>>
>
> Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\ANSWERED>
> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\FLAG>
> stored<uid:1>>
>
> Term search: 0 items: []
>
> ```
>
> Though after a bit more digging in I think I found the issue in the
> James-Lucene code in the update method
> (
> https://github.com/apache/james-project/pull/2342/files#diff-a7c2a3c5cdb7e4a2914c899409991e27df6b25ad54488f197bc533193e3a03d0R1267
> )
>
> There is a comment there that UID values are missing from the
> retrieved document and they have to be re-added (otherwise an
> exception about type being NULL is thrown while trying to update):
>
> ```
>
>                 // somehow the document getting from
> the search lost DocValues data for the uid field, we need to re-define
> the field with proper DocValues.
>
>                 long uidValue =
> doc.getField("uid").numericValue().longValue();
>
>                 doc.removeField("uid");
>
>                 doc.add(new
> NumericDocValuesField(UID_FIELD, uidValue));
>
>                 doc.add(new LongPoint(UID_FIELD,
> uidValue));
>
>                 doc.add(new StoredField(UID_FIELD,
> uidValue));
>
> ```
>
> It seems that the `ID_FIELD` is somehow also missing (even though it's
> output in the debugging `.toString()` thus later on the term search
> with `new Term(ID_FIELD, doc.get(ID_FIELD))` yield 0 results.
>
> When I re-add the field manually like the UID fileds:
>
> ```
>
>                 final String text = doc.get(ID_FIELD);
>
>                 doc.add(new StringField(ID_FIELD, text,
> Store.YES));
>
> ```
>
> then subsequent updating works (because the term subsequently matches
> the ID_FIELD)
>
> So the question seems to boild down to:
>
> 1) why we have to re-define those fields as they seem to be missing
> from the retrieved searched document with:
>
> ```
>
> TopDocs docs = searcher.search(queryBuilder.build(), 100000);
>
> ScoreDoc[] sDocs = docs.scoreDocs;
>
> for (ScoreDoc sDoc : sDocs) {
>
>     Document doc = searcher.doc(sDoc.doc);
>
> ````
>
> 2) if they are missing, why they are included in the document
> (`.toString()`) output?
>
> On 2024-08-10T12:09:29.000+02:00, Wojtek <woj...@unir.se> wrote:
>
> > Thank you Gautam!
> >
> > This works. Now I went back to Lucene and I'm hitting the wall.
> >
> > In James they set document with "id" being constructed as
> >
> > "flag-<uid>-<uid>" (e.g. "<id:flags-1-1>").
> >
> > I run the code that updates the documents with flags and afterwards
> >
> > check the result. The code simple code I use new reader from the
> >
> > writer (so it should be OK and should have new state):
> >
> > ```
> >
> > try (IndexReader reader =
> >
> > DirectoryReader.open(luceneMessageSearchIndex.writer) [
> http://DirectoryReader.open(luceneMessageSearchIndex.writer)]) {
> >
> >     System.out.println("maxDoc: " + reader.maxDoc());
> >
> >     IndexSearcher searcher = new IndexSearcher(reader);
> >
> >     System.out.println("maxDoc (after second flag): " +
> >
> > reader.maxDoc());
> >
> >     // starting from "1" to avoid main mail document
> >
> >     for (int i = 1; i < reader.maxDoc(); i++) {
> >
> >         System.out.println(reader.storedFields().document(i));
> >
> >     }
> >
> >     var idQuery = new TermQuery(new Term("id", "flags-1-1"));
> >
> >     var search = searcher.search(idQuery [http://searcher.search
> (idQuery], 10000);
> >
> >     System.out.println("Term search: " + search.scoreDocs.length +
> >
> > " items: " + Arrays.toString(search.scoreDocs));
> >
> > }
> >
> > ```
> >
> > and the output is following:
> >
> > ```
> >
> > try (IndexReader reader =
> >
> > DirectoryReader.open(luceneMessageSearchIndex.writer) [
> http://DirectoryReader.open(luceneMessageSearchIndex.writer)]) {
> >
> >     System.out.println("maxDoc: " + reader.maxDoc());
> >
> >     IndexSearcher searcher = new IndexSearcher(reader);
> >
> >     System.out.println("maxDoc (after second flag): " +
> >
> > reader.maxDoc());
> >
> >     // starting from "1" to avoid main mail document
> >
> >     for (int i = 1; i < reader.maxDoc(); i++) {
> >
> >         System.out.println(reader.storedFields().document(i));
> >
> >     }
> >
> >     var idQuery = new TermQuery(new Term("id", "flags-1-1"));
> >
> >     var search = searcher.search(idQuery [http://searcher.search
> (idQuery], 10000);
> >
> >     System.out.println("Term search: " + search.scoreDocs.length +
> >
> > " items: " + Arrays.toString(search.scoreDocs));
> >
> > }
> >
> > ```
> >
> > So even though I search for term with "flags-1-1" it yields 0 results
> >
> > (but there are 2 documents with such ID already).
> >
> > The gist of the issue is that for some reasons when trying to update
> >
> > flags document instead of updating it (deleting/adding) it's only
> >
> > being added. My reasoning is that for some reason there is an issue
> >
> > with the term matching to the field so the update "fails" (it adds new
> >
> > document for same term) when updating the document:
> >
> >
> https://github.com/apache/james-project/pull/2342/files#diff-a7c2a3c5cdb7e4a2914c899409991e27df6b25ad54488f197bc533193e3a03d0R1267
> >
> > The code looks ok, while debuging the term yields: "id: flags-1-1"
> >
> > so  it looks OK (but it's only visual string comparison . I thought
> >
> > that it could be the same issue with tokenizer but everywhere in the
> >
> > code StringField is used for the id of the flags:
> >
> > ```
> >
> >     private Document createFlagsDocument(MailboxMessage message) {
> >
> >         Document doc = new Document();
> >
> >         doc.add(new StringField(ID_FIELD, "flags-" +
> >
> > message.getMailboxId().serialize() + "-" +
> >
> > Long.toString(message.getUid().asLong()), Store.YES));
> >
> > …
> >
> > ```
> >
> > So the update based on
> >
> > ```
> >
> > new Term(ID_FIELD, doc.get(ID_FIELD))
> >
> > ```
> >
> > Should hit that exact document - correct?
> >
> > Any pointers on how to debug that and see how/where the comparison is
> >
> > done so I could maybe try to figure out why it doesn't match the
> >
> > documents which causes the update to fail will be greatly appreciated!
> >
> > (I've been at it for a couple of days now and while I learned a great
> >
> > deal about Lucene, starting from absolutely zero knowledge, I think
> >
> > I'm in over my head and stepping into Lucene with debugger doesn't
> >
> > help much as I don't know exactly what/where to look for :) )
> >
> > w.
> >
> > On 2024-08-10T10:21:21.000+02:00, Gautam Worah
> >
> > <worah.gau...@gmail.com> wrote:
> >
> >>  Hey,
> >>
> >>   Use a StringField instead of a TextField for the title and your
> >>  test will
> >>
> >>   pass.
> >>
> >>   Tokenization which is enabled for TextFields, is breaking your
> >>  fancy title
> >>
> >>   into tokens split by spaces, which is causing your docs to not
> >>  match.
> >>
> >>
> https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/document/StringField.html
> >>
> >>   Best,
> >>
> >>   Gautam Worah.
> >>
> >>   On Sat, Aug 10, 2024 at 12:05 AM Wojtek <woj...@unir.se> wrote:
> >>
> >>>   Hi Froh,
> >>>
> >>>    thank you for the information.
> >>>
> >>>    I updated the code and re-open the reader - it seems that the
> >>>
> >>>    update
> >>>
> >>>    is reflected and search for old document doesn't yield anything
> >>>
> >>>    but
> >>>
> >>>    the search for new term fails.
> >>>
> >>>    I output all documents (there are 2) and the second one has new
> >>>
> >>>    title
> >>>
> >>>    but when searching for it no document is found even though it's
> >>>
> >>>    the
> >>>
> >>>    same string that has been used to update the title.
> >>>
> >>>    On 2024-08-10T01:21:39.000+02:00, Michael Froh
> >>>   <msf...@gmail.com>
> >>>
> >>>    wrote:
> >>>
> >>>>    Hi Wojtek,
> >>>>
> >>>>     Thank you for linking to your test code!
> >>>>
> >>>>     When you open an IndexReader, it is locked to the view of the
> >>>>
> >>>>     Lucene
> >>>>
> >>>>     directory at the time that it's opened.
> >>>>
> >>>>     If you make changes, you'll need to open a new IndexReader
> >>>>
> >>>>     before those
> >>>
> >>>>    changes are visible. I see that you tried creating a new
> >>>>
> >>>>     IndexSearcher, but
> >>>>
> >>>>     unfortunately that's not sufficient.
> >>>>
> >>>>     Hope that helps!
> >>>>
> >>>>     Froh
> >>>>
> >>>>     On Fri, Aug 9, 2024 at 3:25 PM Wojtek <woj...@unir.se> wrote:
> >>>>
> >>>>>     Hi all!
> >>>>>
> >>>>>      There is an effort in Apache James to update to a more
> >>>>>     modern
> >>>>>
> >>>>>      version of
> >>>>>
> >>>>>      Lucene (ref:
> >>>>>
> >>>>>      https://github.com/apache/james-project/pull/2342). I'm
> >>>>>
> >>>>>      digging
> >>>>>
> >>>>>      into the
> >>>>>
> >>>>>      issue as other have done
> >>>>>
> >>>>>      but I'm stumped - it seems that
> >>>>>
> >>>>>      `org.apache.lucene.index.IndexWriter#updateDocument`
> >>>>>     doesn't
> >>>>>
> >>>>>      update
> >>>>>
> >>>>>      the document.
> >>>>>
> >>>>>      Documentation
> >>>>>
> >>>>>      (
> >>>
> >>>
> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,java.lang.Iterable)
> >>>
> >>>    )
> >>>
> >>>>>     states:
> >>>>>
> >>>>>      Updates a document by first deleting the document(s)
> >>>>>
> >>>>>      containing
> >>>>>
> >>>>>      term
> >>>>>
> >>>>>      and then adding the new
> >>>>>
> >>>>>      document. The delete and then add are atomic as seen by a
> >>>>>
> >>>>>      reader
> >>>>>
> >>>>>      on the
> >>>>>
> >>>>>      same index (flush may happen
> >>>>>
> >>>>>      only after the add).
> >>>>>
> >>>>>      Here is a simple test with it:
> >>>
> >>>
> https://github.com/woj-tek/lucene-update-test/blob/master/src/test/java/se/unir/AppTest.java
> >>>
> >>>>>     but it fails.
> >>>>>
> >>>>>      Any guidance would be appreciated because I (and others)
> >>>>>     have
> >>>>>
> >>>>>      been hitting
> >>>>>
> >>>>>      wall with it :)
> >>>>>
> >>>>>      --
> >>>>>
> >>>>>      Wojtek
> >>>>>
>
> >>>>>      
> >>>>> ---------------------------------------------------------------------
> >>>>>
> >>>>>      To unsubscribe, e-mail:
> >>>>>
> >>>>>      java-user-unsubscr...@lucene.apache.org
> >>>>>
> >>>>>      For additional commands, e-mail:
> >>>>>
> >>>>>      java-user-h...@lucene.apache.org
>
>

Re: Updating document with IndexWriter#updateDocument doesn't seem to take effect

Reply via email to