Re: Updating document with IndexWriter#updateDocument doesn't seem to take effect

Wojtek Sat, 10 Aug 2024 04:17:16 -0700

Addendum, output is:

```


maxDoc: 3

maxDoc (after second flag): 3

Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\FLAG>
stored<uid:1>>

Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\ANSWERED>
stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\FLAG>
stored<uid:1>>

Term search: 0 items: []

```

Though after a bit more digging in I think I found the issue in the
James-Lucene code in the update method
(https://github.com/apache/james-project/pull/2342/files#diff-a7c2a3c5cdb7e4a2914c899409991e27df6b25ad54488f197bc533193e3a03d0R1267)

There is a comment there that UID values are missing from the
retrieved document and they have to be re-added (otherwise an
exception about type being NULL is thrown while trying to update):

```

                // somehow the document getting from
the search lost DocValues data for the uid field, we need to re-define
the field with proper DocValues.

                long uidValue =
doc.getField("uid").numericValue().longValue();

                doc.removeField("uid");

                doc.add(new
NumericDocValuesField(UID_FIELD, uidValue));

                doc.add(new LongPoint(UID_FIELD,
uidValue));

                doc.add(new StoredField(UID_FIELD,
uidValue));

```

It seems that the `ID_FIELD` is somehow also missing (even though it's
output in the debugging `.toString()` thus later on the term search
with `new Term(ID_FIELD, doc.get(ID_FIELD))` yield 0 results.

When I re-add the field manually like the UID fileds:

```

                final String text = doc.get(ID_FIELD);

                doc.add(new StringField(ID_FIELD, text,
Store.YES));

```

then subsequent updating works (because the term subsequently matches
the ID_FIELD)

So the question seems to boild down to:

1) why we have to re-define those fields as they seem to be missing
from the retrieved searched document with:

```

TopDocs docs = searcher.search(queryBuilder.build(), 100000);

ScoreDoc[] sDocs = docs.scoreDocs;

for (ScoreDoc sDoc : sDocs) {

    Document doc = searcher.doc(sDoc.doc);

````

2) if they are missing, why they are included in the document
(`.toString()`) output?

On 2024-08-10T12:09:29.000+02:00, Wojtek <[email protected]> wrote:

> Thank you Gautam!
> 
> This works. Now I went back to Lucene and I'm hitting the wall.
> 
> In James they set document with "id" being constructed as
> 
> "flag-<uid>-<uid>" (e.g. "<id:flags-1-1>").
> 
> I run the code that updates the documents with flags and afterwards
> 
> check the result. The code simple code I use new reader from the
> 
> writer (so it should be OK and should have new state):
> 
> ```
> 
> try (IndexReader reader =
> 
> DirectoryReader.open(luceneMessageSearchIndex.writer) 
>[http://DirectoryReader.open(luceneMessageSearchIndex.writer)]) {
> 
>     System.out.println("maxDoc: " + reader.maxDoc());
> 
>     IndexSearcher searcher = new IndexSearcher(reader);
> 
>     System.out.println("maxDoc (after second flag): " +
> 
> reader.maxDoc());
> 
>     // starting from "1" to avoid main mail document
> 
>     for (int i = 1; i < reader.maxDoc(); i++) {
> 
>         System.out.println(reader.storedFields().document(i));
> 
>     }
> 
>     var idQuery = new TermQuery(new Term("id", "flags-1-1"));
> 
>     var search = searcher.search(idQuery [http://searcher.search(idQuery], 
>10000);
> 
>     System.out.println("Term search: " + search.scoreDocs.length +
> 
> " items: " + Arrays.toString(search.scoreDocs));
> 
> }
> 
> ```
> 
> and the output is following:
> 
> ```
> 
> try (IndexReader reader =
> 
> DirectoryReader.open(luceneMessageSearchIndex.writer) 
>[http://DirectoryReader.open(luceneMessageSearchIndex.writer)]) {
> 
>     System.out.println("maxDoc: " + reader.maxDoc());
> 
>     IndexSearcher searcher = new IndexSearcher(reader);
> 
>     System.out.println("maxDoc (after second flag): " +
> 
> reader.maxDoc());
> 
>     // starting from "1" to avoid main mail document
> 
>     for (int i = 1; i < reader.maxDoc(); i++) {
> 
>         System.out.println(reader.storedFields().document(i));
> 
>     }
> 
>     var idQuery = new TermQuery(new Term("id", "flags-1-1"));
> 
>     var search = searcher.search(idQuery [http://searcher.search(idQuery], 
>10000);
> 
>     System.out.println("Term search: " + search.scoreDocs.length +
> 
> " items: " + Arrays.toString(search.scoreDocs));
> 
> }
> 
> ```
> 
> So even though I search for term with "flags-1-1" it yields 0 results
> 
> (but there are 2 documents with such ID already).
> 
> The gist of the issue is that for some reasons when trying to update
> 
> flags document instead of updating it (deleting/adding) it's only
> 
> being added. My reasoning is that for some reason there is an issue
> 
> with the term matching to the field so the update "fails" (it adds new
> 
> document for same term) when updating the document:
> 
> 
>https://github.com/apache/james-project/pull/2342/files#diff-a7c2a3c5cdb7e4a2914c899409991e27df6b25ad54488f197bc533193e3a03d0R1267
> 
> The code looks ok, while debuging the term yields: "id: flags-1-1"
> 
> so  it looks OK (but it's only visual string comparison . I thought
> 
> that it could be the same issue with tokenizer but everywhere in the
> 
> code StringField is used for the id of the flags:
> 
> ```
> 
>     private Document createFlagsDocument(MailboxMessage message) {
> 
>         Document doc = new Document();
> 
>         doc.add(new StringField(ID_FIELD, "flags-" +
> 
> message.getMailboxId().serialize() + "-" +
> 
> Long.toString(message.getUid().asLong()), Store.YES));
> 
> …
> 
> ```
> 
> So the update based on
> 
> ```
> 
> new Term(ID_FIELD, doc.get(ID_FIELD))
> 
> ```
> 
> Should hit that exact document - correct?
> 
> Any pointers on how to debug that and see how/where the comparison is
> 
> done so I could maybe try to figure out why it doesn't match the
> 
> documents which causes the update to fail will be greatly appreciated!
> 
> (I've been at it for a couple of days now and while I learned a great
> 
> deal about Lucene, starting from absolutely zero knowledge, I think
> 
> I'm in over my head and stepping into Lucene with debugger doesn't
> 
> help much as I don't know exactly what/where to look for :) )
> 
> w.
> 
> On 2024-08-10T10:21:21.000+02:00, Gautam Worah
> 
> <[email protected]> wrote:
> 
>>  Hey,
>>  
>>   Use a StringField instead of a TextField for the title and your
>>  test will
>>  
>>   pass.
>>  
>>   Tokenization which is enabled for TextFields, is breaking your
>>  fancy title
>>  
>>   into tokens split by spaces, which is causing your docs to not
>>  match.
>>  
>>   
>>https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/document/StringField.html
>>  
>>   Best,
>>  
>>   Gautam Worah.
>>  
>>   On Sat, Aug 10, 2024 at 12:05 AM Wojtek <[email protected]> wrote:
>>  
>>>   Hi Froh,
>>>   
>>>    thank you for the information.
>>>   
>>>    I updated the code and re-open the reader - it seems that the
>>>   
>>>    update
>>>   
>>>    is reflected and search for old document doesn't yield anything
>>>   
>>>    but
>>>   
>>>    the search for new term fails.
>>>   
>>>    I output all documents (there are 2) and the second one has new
>>>   
>>>    title
>>>   
>>>    but when searching for it no document is found even though it's
>>>   
>>>    the
>>>   
>>>    same string that has been used to update the title.
>>>   
>>>    On 2024-08-10T01:21:39.000+02:00, Michael Froh
>>>   <[email protected]>
>>>   
>>>    wrote:
>>>   
>>>>    Hi Wojtek,
>>>>    
>>>>     Thank you for linking to your test code!
>>>>    
>>>>     When you open an IndexReader, it is locked to the view of the
>>>>    
>>>>     Lucene
>>>>    
>>>>     directory at the time that it's opened.
>>>>    
>>>>     If you make changes, you'll need to open a new IndexReader
>>>>    
>>>>     before those
>>>   
>>>>    changes are visible. I see that you tried creating a new
>>>>    
>>>>     IndexSearcher, but
>>>>    
>>>>     unfortunately that's not sufficient.
>>>>    
>>>>     Hope that helps!
>>>>    
>>>>     Froh
>>>>    
>>>>     On Fri, Aug 9, 2024 at 3:25 PM Wojtek <[email protected]> wrote:
>>>>    
>>>>>     Hi all!
>>>>>     
>>>>>      There is an effort in Apache James to update to a more
>>>>>     modern
>>>>>     
>>>>>      version of
>>>>>     
>>>>>      Lucene (ref:
>>>>>     
>>>>>      https://github.com/apache/james-project/pull/2342). I'm
>>>>>     
>>>>>      digging
>>>>>     
>>>>>      into the
>>>>>     
>>>>>      issue as other have done
>>>>>     
>>>>>      but I'm stumped - it seems that
>>>>>     
>>>>>      `org.apache.lucene.index.IndexWriter#updateDocument`
>>>>>     doesn't
>>>>>     
>>>>>      update
>>>>>     
>>>>>      the document.
>>>>>     
>>>>>      Documentation
>>>>>     
>>>>>      (
>>>   
>>>    
>>>https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,java.lang.Iterable)
>>>   
>>>    )
>>>   
>>>>>     states:
>>>>>     
>>>>>      Updates a document by first deleting the document(s)
>>>>>     
>>>>>      containing
>>>>>     
>>>>>      term
>>>>>     
>>>>>      and then adding the new
>>>>>     
>>>>>      document. The delete and then add are atomic as seen by a
>>>>>     
>>>>>      reader
>>>>>     
>>>>>      on the
>>>>>     
>>>>>      same index (flush may happen
>>>>>     
>>>>>      only after the add).
>>>>>     
>>>>>      Here is a simple test with it:
>>>   
>>>    
>>>https://github.com/woj-tek/lucene-update-test/blob/master/src/test/java/se/unir/AppTest.java
>>>   
>>>>>     but it fails.
>>>>>     
>>>>>      Any guidance would be appreciated because I (and others)
>>>>>     have
>>>>>     
>>>>>      been hitting
>>>>>     
>>>>>      wall with it :)
>>>>>     
>>>>>      --
>>>>>     
>>>>>      Wojtek
>>>>>     
>>>>>      ---------------------------------------------------------------------
>>>>>     
>>>>>      To unsubscribe, e-mail:
>>>>>     
>>>>>      [email protected]
>>>>>     
>>>>>      For additional commands, e-mail:
>>>>>     
>>>>>      [email protected]

Re: Updating document with IndexWriter#updateDocument doesn't seem to take effect

Reply via email to