Addendum, output is: ```
maxDoc: 3 maxDoc (after second flag): 3 Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\FLAG> stored<uid:1>> Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\ANSWERED> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\FLAG> stored<uid:1>> Term search: 0 items: [] ``` Though after a bit more digging in I think I found the issue in the James-Lucene code in the update method (https://github.com/apache/james-project/pull/2342/files#diff-a7c2a3c5cdb7e4a2914c899409991e27df6b25ad54488f197bc533193e3a03d0R1267) There is a comment there that UID values are missing from the retrieved document and they have to be re-added (otherwise an exception about type being NULL is thrown while trying to update): ``` // somehow the document getting from the search lost DocValues data for the uid field, we need to re-define the field with proper DocValues. long uidValue = doc.getField("uid").numericValue().longValue(); doc.removeField("uid"); doc.add(new NumericDocValuesField(UID_FIELD, uidValue)); doc.add(new LongPoint(UID_FIELD, uidValue)); doc.add(new StoredField(UID_FIELD, uidValue)); ``` It seems that the `ID_FIELD` is somehow also missing (even though it's output in the debugging `.toString()` thus later on the term search with `new Term(ID_FIELD, doc.get(ID_FIELD))` yield 0 results. When I re-add the field manually like the UID fileds: ``` final String text = doc.get(ID_FIELD); doc.add(new StringField(ID_FIELD, text, Store.YES)); ``` then subsequent updating works (because the term subsequently matches the ID_FIELD) So the question seems to boild down to: 1) why we have to re-define those fields as they seem to be missing from the retrieved searched document with: ``` TopDocs docs = searcher.search(queryBuilder.build(), 100000); ScoreDoc[] sDocs = docs.scoreDocs; for (ScoreDoc sDoc : sDocs) { Document doc = searcher.doc(sDoc.doc); ```` 2) if they are missing, why they are included in the document (`.toString()`) output? On 2024-08-10T12:09:29.000+02:00, Wojtek <woj...@unir.se> wrote: > Thank you Gautam! > > This works. Now I went back to Lucene and I'm hitting the wall. > > In James they set document with "id" being constructed as > > "flag-<uid>-<uid>" (e.g. "<id:flags-1-1>"). > > I run the code that updates the documents with flags and afterwards > > check the result. The code simple code I use new reader from the > > writer (so it should be OK and should have new state): > > ``` > > try (IndexReader reader = > > DirectoryReader.open(luceneMessageSearchIndex.writer) >[http://DirectoryReader.open(luceneMessageSearchIndex.writer)]) { > > System.out.println("maxDoc: " + reader.maxDoc()); > > IndexSearcher searcher = new IndexSearcher(reader); > > System.out.println("maxDoc (after second flag): " + > > reader.maxDoc()); > > // starting from "1" to avoid main mail document > > for (int i = 1; i < reader.maxDoc(); i++) { > > System.out.println(reader.storedFields().document(i)); > > } > > var idQuery = new TermQuery(new Term("id", "flags-1-1")); > > var search = searcher.search(idQuery [http://searcher.search(idQuery], >10000); > > System.out.println("Term search: " + search.scoreDocs.length + > > " items: " + Arrays.toString(search.scoreDocs)); > > } > > ``` > > and the output is following: > > ``` > > try (IndexReader reader = > > DirectoryReader.open(luceneMessageSearchIndex.writer) >[http://DirectoryReader.open(luceneMessageSearchIndex.writer)]) { > > System.out.println("maxDoc: " + reader.maxDoc()); > > IndexSearcher searcher = new IndexSearcher(reader); > > System.out.println("maxDoc (after second flag): " + > > reader.maxDoc()); > > // starting from "1" to avoid main mail document > > for (int i = 1; i < reader.maxDoc(); i++) { > > System.out.println(reader.storedFields().document(i)); > > } > > var idQuery = new TermQuery(new Term("id", "flags-1-1")); > > var search = searcher.search(idQuery [http://searcher.search(idQuery], >10000); > > System.out.println("Term search: " + search.scoreDocs.length + > > " items: " + Arrays.toString(search.scoreDocs)); > > } > > ``` > > So even though I search for term with "flags-1-1" it yields 0 results > > (but there are 2 documents with such ID already). > > The gist of the issue is that for some reasons when trying to update > > flags document instead of updating it (deleting/adding) it's only > > being added. My reasoning is that for some reason there is an issue > > with the term matching to the field so the update "fails" (it adds new > > document for same term) when updating the document: > > >https://github.com/apache/james-project/pull/2342/files#diff-a7c2a3c5cdb7e4a2914c899409991e27df6b25ad54488f197bc533193e3a03d0R1267 > > The code looks ok, while debuging the term yields: "id: flags-1-1" > > so it looks OK (but it's only visual string comparison . I thought > > that it could be the same issue with tokenizer but everywhere in the > > code StringField is used for the id of the flags: > > ``` > > private Document createFlagsDocument(MailboxMessage message) { > > Document doc = new Document(); > > doc.add(new StringField(ID_FIELD, "flags-" + > > message.getMailboxId().serialize() + "-" + > > Long.toString(message.getUid().asLong()), Store.YES)); > > … > > ``` > > So the update based on > > ``` > > new Term(ID_FIELD, doc.get(ID_FIELD)) > > ``` > > Should hit that exact document - correct? > > Any pointers on how to debug that and see how/where the comparison is > > done so I could maybe try to figure out why it doesn't match the > > documents which causes the update to fail will be greatly appreciated! > > (I've been at it for a couple of days now and while I learned a great > > deal about Lucene, starting from absolutely zero knowledge, I think > > I'm in over my head and stepping into Lucene with debugger doesn't > > help much as I don't know exactly what/where to look for :) ) > > w. > > On 2024-08-10T10:21:21.000+02:00, Gautam Worah > > <worah.gau...@gmail.com> wrote: > >> Hey, >> >> Use a StringField instead of a TextField for the title and your >> test will >> >> pass. >> >> Tokenization which is enabled for TextFields, is breaking your >> fancy title >> >> into tokens split by spaces, which is causing your docs to not >> match. >> >> >>https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/document/StringField.html >> >> Best, >> >> Gautam Worah. >> >> On Sat, Aug 10, 2024 at 12:05 AM Wojtek <woj...@unir.se> wrote: >> >>> Hi Froh, >>> >>> thank you for the information. >>> >>> I updated the code and re-open the reader - it seems that the >>> >>> update >>> >>> is reflected and search for old document doesn't yield anything >>> >>> but >>> >>> the search for new term fails. >>> >>> I output all documents (there are 2) and the second one has new >>> >>> title >>> >>> but when searching for it no document is found even though it's >>> >>> the >>> >>> same string that has been used to update the title. >>> >>> On 2024-08-10T01:21:39.000+02:00, Michael Froh >>> <msf...@gmail.com> >>> >>> wrote: >>> >>>> Hi Wojtek, >>>> >>>> Thank you for linking to your test code! >>>> >>>> When you open an IndexReader, it is locked to the view of the >>>> >>>> Lucene >>>> >>>> directory at the time that it's opened. >>>> >>>> If you make changes, you'll need to open a new IndexReader >>>> >>>> before those >>> >>>> changes are visible. I see that you tried creating a new >>>> >>>> IndexSearcher, but >>>> >>>> unfortunately that's not sufficient. >>>> >>>> Hope that helps! >>>> >>>> Froh >>>> >>>> On Fri, Aug 9, 2024 at 3:25 PM Wojtek <woj...@unir.se> wrote: >>>> >>>>> Hi all! >>>>> >>>>> There is an effort in Apache James to update to a more >>>>> modern >>>>> >>>>> version of >>>>> >>>>> Lucene (ref: >>>>> >>>>> https://github.com/apache/james-project/pull/2342). I'm >>>>> >>>>> digging >>>>> >>>>> into the >>>>> >>>>> issue as other have done >>>>> >>>>> but I'm stumped - it seems that >>>>> >>>>> `org.apache.lucene.index.IndexWriter#updateDocument` >>>>> doesn't >>>>> >>>>> update >>>>> >>>>> the document. >>>>> >>>>> Documentation >>>>> >>>>> ( >>> >>> >>>https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,java.lang.Iterable) >>> >>> ) >>> >>>>> states: >>>>> >>>>> Updates a document by first deleting the document(s) >>>>> >>>>> containing >>>>> >>>>> term >>>>> >>>>> and then adding the new >>>>> >>>>> document. The delete and then add are atomic as seen by a >>>>> >>>>> reader >>>>> >>>>> on the >>>>> >>>>> same index (flush may happen >>>>> >>>>> only after the add). >>>>> >>>>> Here is a simple test with it: >>> >>> >>>https://github.com/woj-tek/lucene-update-test/blob/master/src/test/java/se/unir/AppTest.java >>> >>>>> but it fails. >>>>> >>>>> Any guidance would be appreciated because I (and others) >>>>> have >>>>> >>>>> been hitting >>>>> >>>>> wall with it :) >>>>> >>>>> -- >>>>> >>>>> Wojtek >>>>> >>>>> --------------------------------------------------------------------- >>>>> >>>>> To unsubscribe, e-mail: >>>>> >>>>> java-user-unsubscr...@lucene.apache.org >>>>> >>>>> For additional commands, e-mail: >>>>> >>>>> java-user-h...@lucene.apache.org