Re: Updating document with IndexWriter#updateDocument doesn't seem to take effect

Wojtek Sat, 10 Aug 2024 15:12:50 -0700

Hi,

thank you for reply and apologies for being somewhat "all over the
place".


Regarding "tokenization" - should it happen if I use StringField?

When the document is created (before writing) i see in the debugger
it's not tokenized and is of type StringField:

```

doc = {Document@4830}
"Document<stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>>"

fields = {ArrayList@5920}  size = 1

  0 = {StringField@5922}
"stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>"

```

But once in the update method (document being retrieved) I see it
changes to StoredField and is already "tokenized":

```

doc = {Document@6526}
"Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>
docValuesType=NUMERIC<uid:1> LongPoint <uid:1> stored<uid:1>>"

fields = {ArrayList@6548}  size = 6

  0 = {StoredField@6550}
"stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>"

  1 = {StoredField@6551}
"stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>"

  2 = {StringField@6552}
"stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>"

  3 = {NumericDocValuesField@6553} "docValuesType=NUMERIC<uid:1>"

  4 = {LongPoint@6554} "LongPoint <uid:1>"

  5 = {StoredField@6555} "stored<uid:1>"

```

The code that adds the documents - it's a method implemented in James:

`org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex#add`
(
https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1240
) that looks fairly straightforward:

```

public Mono<Void> add(MailboxSession session, Mailbox mailbox,
MailboxMessage membership) {

    return Mono.fromRunnable(Throwing.runnable(() -> {

        Document doc = createMessageDocument(session,
membership);

        Document flagsDoc = createFlagsDocument(membership);

        writer.addDocument(doc);

        writer.addDocument(flagsDoc);

    }));

}

```

similarly to actual method that creates the flags
(https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1290):
```

private Document createFlagsDocument(MailboxMessage message) {

    Document doc = new Document();

    doc.add(new StringField(ID_FIELD, "flags-" +
message.getMailboxId().serialize() + "-" +
Long.toString(message.getUid().asLong()), Store.YES));

    doc.add(new StringField(MAILBOX_ID_FIELD,
message.getMailboxId().serialize(), Store.YES));

    doc.add(new NumericDocValuesField(UID_FIELD,
message.getUid().asLong()));

    doc.add(new LongPoint(UID_FIELD, message.getUid().asLong()));

    doc.add(new StoredField(UID_FIELD, message.getUid().asLong()));

    indexFlags(doc, message.createFlags());

    return doc;

}

```

As you can see `StringField` is used when creating the document and to
the best of my knowledge and based on what I was told - it _should_
not be tokenized (?).

Update (in which the document can't be updated because Term seems to
be not finding it) is done in
`org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex#update()`
(https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1259):

```

private void update(MailboxId mailboxId, MessageUid uid, Flags f)
throws IOException {

    try (IndexReader reader = DirectoryReader.open(writer)) {

        IndexSearcher searcher = new IndexSearcher(reader);

        BooleanQuery.Builder queryBuilder = new
BooleanQuery.Builder();

        queryBuilder.add(new TermQuery(new
Term(MAILBOX_ID_FIELD, mailboxId.serialize())),
BooleanClause.Occur.MUST);

        queryBuilder.add(createQuery(MessageRange.one(uid)),
BooleanClause.Occur.MUST);

        queryBuilder.add(new PrefixQuery(new Term(FLAGS_FIELD,
"")), BooleanClause.Occur.MUST);

        TopDocs docs = searcher.search(queryBuilder.build(),
100000);

        ScoreDoc[] sDocs = docs.scoreDocs;

        for (ScoreDoc sDoc : sDocs) {

            Document doc = searcher.doc(sDoc.doc);

            doc.removeFields(FLAGS_FIELD);

            indexFlags(doc, f);

            // somehow the document getting from the search
lost DocValues data for the uid field, we need to re-define the field
with proper DocValues.

            long uidValue =
doc.getField("uid").numericValue().longValue();

            doc.removeField("uid");

            doc.add(new NumericDocValuesField(UID_FIELD,
uidValue));

            doc.add(new LongPoint(UID_FIELD, uidValue));

            doc.add(new StoredField(UID_FIELD, uidValue));

            writer.updateDocument(new Term(ID_FIELD,
doc.get(ID_FIELD)), doc);

        }

    }

}

```

I was wondering if Lucene/writer configuration is not a culprit (that
would result in tokenizing even StringField) but it looks fairly
straightforward:

```

this.directory = directory;

this.writer = new IndexWriter(this.directory, 
createConfig(createAnalyzer(lenient), dropIndexOnStart));

```

where createConfig looks like this:

```

protected IndexWriterConfig createConfig(Analyzer analyzer, boolean
dropIndexOnStart) {

    IndexWriterConfig config = new IndexWriterConfig(analyzer);

    if (dropIndexOnStart) {

        config.setOpenMode(OpenMode.CREATE);

    } else {

        config.setOpenMode(OpenMode.CREATE_OR_APPEND);

    }

    return config;

}

```

and createAnalyzer like this:

```

protected Analyzer createAnalyzer(boolean lenient) {

    if (lenient) {

       return new LenientImapSearchAnalyzer();

    } else {

        return new StrictImapSearchAnalyzer();

    }

}

```

On 2024-08-10T21:04:15.000+02:00, Gautam Worah
<worah.gau...@gmail.com> wrote:

> Hey,
> 
> I don't think I understand the email well but I'll try my best.
> 
> In your printed docs, I see that the flag data is still tokenized. See the
> 
> string that you printed: DOCS<id:flags-1-1>
> 
> stored,indexed,tokenized,omitNorms. What does your code for adding the doc
> 
> look like?
> 
> Are you using StringField for adding the field to the doc?
> 
> I think this is why when you re-add the field with a StringField, the test
> 
> works.
> 
> Lucene's StandardTokenizer for 9.11 uses the Unicode Text Segmentation
> 
> algorithm, as specified in Unicode Standard Annex #29
> 
> <http://unicode.org/reports/tr29/> [http://unicode.org/reports/tr29/>];.
> 
> That standard contains a "-" as a word breaker.
> 
> I guess that is what is breaking your code.
> 
> You are using Lucene's NRT for your search. In general, for debugging such
> 
> cases, I add an IndexWriter.commit() after you are done updating the doc,
> 
> and see if it fixes things.
> 
> If it does, then it has something to do with NRT, and deleting docs etc. If
> 
> not, then that means that your query/data is wrong somewhere. This is how I
> 
> debugged your first problem.
> 
> Best,
> 
> Gautam Worah.
> 
> On Sat, Aug 10, 2024 at 4:17 AM Wojtek <woj...@unir.se> wrote:
> 
>>  Addendum, output is:
>>  
>>   ```
>>  
>>   maxDoc: 3
>>  
>>   maxDoc (after second flag): 3
>>  
>>>   
>>>Document&lt;stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
>>>   
>>>    stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
>>>   
>>>    stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\FLAG>
>>>   
>>>>    stored<uid:1>>
>>  
>>>   
>>>Document&lt;stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
>>>   
>>>    stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
>>>   
>>>    stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\ANSWERED>
>>>   
>>>    stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\FLAG>
>>>   
>>>>    stored<uid:1>>
>>  
>>   Term search: 0 items: []
>>  
>>   ```
>>  
>>   Though after a bit more digging in I think I found the issue in
>>  the
>>  
>>   James-Lucene code in the update method
>>  
>>   (
>>  
>>   
>>https://github.com/apache/james-project/pull/2342/files#diff-a7c2a3c5cdb7e4a2914c899409991e27df6b25ad54488f197bc533193e3a03d0R1267
>>  
>>   )
>>  
>>   There is a comment there that UID values are missing from the
>>  
>>   retrieved document and they have to be re-added (otherwise an
>>  
>>   exception about type being NULL is thrown while trying to
>>  update):
>>  
>>   ```
>>  
>>   // somehow the document getting from
>>  
>>   the search lost DocValues data for the uid field, we need to
>>  re-define
>>  
>>   the field with proper DocValues.
>>  
>>   long uidValue =
>>  
>>   doc.getField("uid").numericValue().longValue();
>>  
>>   doc.removeField("uid");
>>  
>>   doc.add(new
>>  
>>   NumericDocValuesField(UID_FIELD, uidValue));
>>  
>>   doc.add(new LongPoint(UID_FIELD,
>>  
>>   uidValue));
>>  
>>   doc.add(new StoredField(UID_FIELD,
>>  
>>   uidValue));
>>  
>>   ```
>>  
>>   It seems that the `ID_FIELD` is somehow also missing (even though
>>  it's
>>  
>>   output in the debugging `.toString()` thus later on the term
>>  search
>>  
>>   with `new Term(ID_FIELD, doc.get(ID_FIELD))` yield 0 results.
>>  
>>   When I re-add the field manually like the UID fileds:
>>  
>>   ```
>>  
>>   final String text = doc.get(ID_FIELD);
>>  
>>   doc.add(new StringField(ID_FIELD, text,
>>  
>>   Store.YES));
>>  
>>   ```
>>  
>>   then subsequent updating works (because the term subsequently
>>  matches
>>  
>>   the ID_FIELD)
>>  
>>   So the question seems to boild down to:
>>  
>>   1) why we have to re-define those fields as they seem to be
>>  missing
>>  
>>   from the retrieved searched document with:
>>  
>>   ```
>>  
>>   TopDocs docs = searcher.search(queryBuilder.build
>>  [http://searcher.search(queryBuilder.build](), 100000);
>>  
>>   ScoreDoc[] sDocs = docs.scoreDocs;
>>  
>>   for (ScoreDoc sDoc : sDocs) {
>>  
>>   Document doc = searcher.doc(sDoc.doc);
>>  
>>   ````
>>  
>>   2) if they are missing, why they are included in the document
>>  
>>   (`.toString()`) output?
>>  
>>   On 2024-08-10T12:09:29.000+02:00, Wojtek <woj...@unir.se> wrote:
>>  
>>>   Thank you Gautam!
>>>   
>>>    This works. Now I went back to Lucene and I'm hitting the wall.
>>>   
>>>    In James they set document with "id" being constructed as
>>>   
>>>>>     "flag-<uid>-<uid>" (e.g. "<id:flags-1-1>").
>>>   
>>>    I run the code that updates the documents with flags and
>>>   afterwards
>>>   
>>>    check the result. The code simple code I use new reader from
>>>   the
>>>   
>>>    writer (so it should be OK and should have new state):
>>>   
>>>    ```
>>>   
>>>    try (IndexReader reader =
>>>   
>>>    DirectoryReader.open(luceneMessageSearchIndex.writer)
>>>   [http://DirectoryReader.open(luceneMessageSearchIndex.writer)] [
>>  
>>   http://DirectoryReader.open(luceneMessageSearchIndex.writer)]) {
>>  
>>>   System.out.println("maxDoc: " + reader.maxDoc());
>>>   
>>>    IndexSearcher searcher = new IndexSearcher(reader);
>>>   
>>>    System.out.println("maxDoc (after second flag): " +
>>>   
>>>    reader.maxDoc());
>>>   
>>>    // starting from "1" to avoid main mail document
>>>   
>>>    for (int i = 1; i < reader.maxDoc(); i++) {
>>>   
>>>    System.out.println(reader.storedFields().document(i));
>>>   
>>>    }
>>>   
>>>    var idQuery = new TermQuery(new Term("id", "flags-1-1"));
>>>   
>>>    var search = searcher.search(idQuery
>>>   [http://searcher.search(idQuery] [http://searcher.search
>>  
>>   (idQuery], 10000);
>>  
>>>   System.out.println("Term search: " + search.scoreDocs.length +
>>>   
>>>    " items: " + Arrays.toString(search.scoreDocs));
>>>   
>>>    }
>>>   
>>>    ```
>>>   
>>>    and the output is following:
>>>   
>>>    ```
>>>   
>>>    try (IndexReader reader =
>>>   
>>>    DirectoryReader.open(luceneMessageSearchIndex.writer)
>>>   [http://DirectoryReader.open(luceneMessageSearchIndex.writer)] [
>>  
>>   http://DirectoryReader.open(luceneMessageSearchIndex.writer)]) {
>>  
>>>   System.out.println("maxDoc: " + reader.maxDoc());
>>>   
>>>    IndexSearcher searcher = new IndexSearcher(reader);
>>>   
>>>    System.out.println("maxDoc (after second flag): " +
>>>   
>>>    reader.maxDoc());
>>>   
>>>    // starting from "1" to avoid main mail document
>>>   
>>>    for (int i = 1; i < reader.maxDoc(); i++) {
>>>   
>>>    System.out.println(reader.storedFields().document(i));
>>>   
>>>    }
>>>   
>>>    var idQuery = new TermQuery(new Term("id", "flags-1-1"));
>>>   
>>>    var search = searcher.search(idQuery
>>>   [http://searcher.search(idQuery] [http://searcher.search
>>  
>>   (idQuery], 10000);
>>  
>>>   System.out.println("Term search: " + search.scoreDocs.length +
>>>   
>>>    " items: " + Arrays.toString(search.scoreDocs));
>>>   
>>>    }
>>>   
>>>    ```
>>>   
>>>    So even though I search for term with "flags-1-1" it yields 0
>>>   results
>>>   
>>>    (but there are 2 documents with such ID already).
>>>   
>>>    The gist of the issue is that for some reasons when trying to
>>>   update
>>>   
>>>    flags document instead of updating it (deleting/adding) it's
>>>   only
>>>   
>>>    being added. My reasoning is that for some reason there is an
>>>   issue
>>>   
>>>    with the term matching to the field so the update "fails" (it
>>>   adds new
>>>   
>>>    document for same term) when updating the document:
>>  
>>   
>>https://github.com/apache/james-project/pull/2342/files#diff-a7c2a3c5cdb7e4a2914c899409991e27df6b25ad54488f197bc533193e3a03d0R1267
>>  
>>>   The code looks ok, while debuging the term yields: "id:
>>>   flags-1-1"
>>>   
>>>    so it looks OK (but it's only visual string comparison . I
>>>   thought
>>>   
>>>    that it could be the same issue with tokenizer but everywhere
>>>   in the
>>>   
>>>    code StringField is used for the id of the flags:
>>>   
>>>    ```
>>>   
>>>    private Document createFlagsDocument(MailboxMessage message) {
>>>   
>>>    Document doc = new Document();
>>>   
>>>    doc.add(new StringField(ID_FIELD, "flags-" +
>>>   
>>>    message.getMailboxId().serialize() + "-" +
>>>   
>>>    Long.toString(message.getUid().asLong()), Store.YES));
>>>   
>>>    …
>>>   
>>>    ```
>>>   
>>>    So the update based on
>>>   
>>>    ```
>>>   
>>>    new Term(ID_FIELD, doc.get(ID_FIELD))
>>>   
>>>    ```
>>>   
>>>    Should hit that exact document - correct?
>>>   
>>>    Any pointers on how to debug that and see how/where the
>>>   comparison is
>>>   
>>>    done so I could maybe try to figure out why it doesn't match
>>>   the
>>>   
>>>    documents which causes the update to fail will be greatly
>>>   appreciated!
>>>   
>>>    (I've been at it for a couple of days now and while I learned a
>>>   great
>>>   
>>>    deal about Lucene, starting from absolutely zero knowledge, I
>>>   think
>>>   
>>>    I'm in over my head and stepping into Lucene with debugger
>>>   doesn't
>>>   
>>>    help much as I don't know exactly what/where to look for :) )
>>>   
>>>    w.
>>>   
>>>    On 2024-08-10T10:21:21.000+02:00, Gautam Worah
>>>   
>>>    <worah.gau...@gmail.com> wrote:
>>>   
>>>>    Hey,
>>>>    
>>>>     Use a StringField instead of a TextField for the title and
>>>>    your
>>>>    
>>>>     test will
>>>>    
>>>>     pass.
>>>>    
>>>>     Tokenization which is enabled for TextFields, is breaking
>>>>    your
>>>>    
>>>>     fancy title
>>>>    
>>>>     into tokens split by spaces, which is causing your docs to
>>>>    not
>>>>    
>>>>     match.
>>  
>>   
>>https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/document/StringField.html
>>  
>>>>    Best,
>>>>    
>>>>     Gautam Worah.
>>>>    
>>>>     On Sat, Aug 10, 2024 at 12:05 AM Wojtek <woj...@unir.se>
>>>>    wrote:
>>>>    
>>>>>     Hi Froh,
>>>>>     
>>>>>      thank you for the information.
>>>>>     
>>>>>      I updated the code and re-open the reader - it seems that
>>>>>     the
>>>>>     
>>>>>      update
>>>>>     
>>>>>      is reflected and search for old document doesn't yield
>>>>>     anything
>>>>>     
>>>>>      but
>>>>>     
>>>>>      the search for new term fails.
>>>>>     
>>>>>      I output all documents (there are 2) and the second one has
>>>>>     new
>>>>>     
>>>>>      title
>>>>>     
>>>>>      but when searching for it no document is found even though
>>>>>     it's
>>>>>     
>>>>>      the
>>>>>     
>>>>>      same string that has been used to update the title.
>>>>>     
>>>>>      On 2024-08-10T01:21:39.000+02:00, Michael Froh
>>>>>     
>>>>>      <msf...@gmail.com>
>>>>>     
>>>>>      wrote:
>>>>>     
>>>>>>      Hi Wojtek,
>>>>>>      
>>>>>>       Thank you for linking to your test code!
>>>>>>      
>>>>>>       When you open an IndexReader, it is locked to the view of
>>>>>>      the
>>>>>>      
>>>>>>       Lucene
>>>>>>      
>>>>>>       directory at the time that it's opened.
>>>>>>      
>>>>>>       If you make changes, you'll need to open a new
>>>>>>      IndexReader
>>>>>>      
>>>>>>       before those
>>>>>     
>>>>>>      changes are visible. I see that you tried creating a new
>>>>>>      
>>>>>>       IndexSearcher, but
>>>>>>      
>>>>>>       unfortunately that's not sufficient.
>>>>>>      
>>>>>>       Hope that helps!
>>>>>>      
>>>>>>       Froh
>>>>>>      
>>>>>>       On Fri, Aug 9, 2024 at 3:25 PM Wojtek <woj...@unir.se>
>>>>>>      wrote:
>>>>>>      
>>>>>>>       Hi all!
>>>>>>>       
>>>>>>>        There is an effort in Apache James to update to a more
>>>>>>>       
>>>>>>>        modern
>>>>>>>       
>>>>>>>        version of
>>>>>>>       
>>>>>>>        Lucene (ref:
>>>>>>>       
>>>>>>>        https://github.com/apache/james-project/pull/2342). I'm
>>>>>>>       
>>>>>>>        digging
>>>>>>>       
>>>>>>>        into the
>>>>>>>       
>>>>>>>        issue as other have done
>>>>>>>       
>>>>>>>        but I'm stumped - it seems that
>>>>>>>       
>>>>>>>        `org.apache.lucene.index.IndexWriter#updateDocument`
>>>>>>>       
>>>>>>>        doesn't
>>>>>>>       
>>>>>>>        update
>>>>>>>       
>>>>>>>        the document.
>>>>>>>       
>>>>>>>        Documentation
>>>>>>>       
>>>>>>>        (
>>  
>>   
>>https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,java.lang.Iterable)
>>  
>>>>>     )
>>>>>     
>>>>>>>       states:
>>>>>>>       
>>>>>>>        Updates a document by first deleting the document(s)
>>>>>>>       
>>>>>>>        containing
>>>>>>>       
>>>>>>>        term
>>>>>>>       
>>>>>>>        and then adding the new
>>>>>>>       
>>>>>>>        document. The delete and then add are atomic as seen by
>>>>>>>       a
>>>>>>>       
>>>>>>>        reader
>>>>>>>       
>>>>>>>        on the
>>>>>>>       
>>>>>>>        same index (flush may happen
>>>>>>>       
>>>>>>>        only after the add).
>>>>>>>       
>>>>>>>        Here is a simple test with it:
>>  
>>   
>>https://github.com/woj-tek/lucene-update-test/blob/master/src/test/java/se/unir/AppTest.java
>>  
>>>>>>>       but it fails.
>>>>>>>       
>>>>>>>        Any guidance would be appreciated because I (and
>>>>>>>       others)
>>>>>>>       
>>>>>>>        have
>>>>>>>       
>>>>>>>        been hitting
>>>>>>>       
>>>>>>>        wall with it :)
>>>>>>>       
>>>>>>>        --
>>>>>>>       
>>>>>>>        Wojtek
>>  
>>>>>>>       
>>>>>>>---------------------------------------------------------------------
>>>>>>>       
>>>>>>>        To unsubscribe, e-mail:
>>>>>>>       
>>>>>>>        java-user-unsubscr...@lucene.apache.org
>>>>>>>       
>>>>>>>        For additional commands, e-mail:
>>>>>>>       
>>>>>>>        java-user-h...@lucene.apache.org

Re: Updating document with IndexWriter#updateDocument doesn't seem to take effect

Reply via email to