I'm confused as to what could be happening. Google led me to this StackOverflow link: https://stackoverflow.com/questions/36402235/lucene-stringfield-gets-tokenized-when-doc-is-retrieved-and-stored-again which references some longstanding old issues about fields changing their "types" and so on. The docs mention: `NOTE: only the content of a field is returned if that field was stored during indexing. Metadata like boost, omitNorm, IndexOptions, tokenized, etc., are not preserved.`
Can you check what `doc.get(ID_FIELD)` returns, and if it looks right? Maybe try a simple `new TermQuery(new Term("id", "flags-1-1"))` query during update and see if it returns the correct ans? If the value is not right, perhaps you may have to use the original stored value: https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/search/IndexSearcher.html#storedFields() for crafting the `updateDocument()` call.. Best, Gautam Worah. On Sat, Aug 10, 2024 at 3:12 PM Wojtek <woj...@unir.se> wrote: > Hi, > > thank you for reply and apologies for being somewhat "all over the > place". > > Regarding "tokenization" - should it happen if I use StringField? > > When the document is created (before writing) i see in the debugger > it's not tokenized and is of type StringField: > > ``` > > doc = {Document@4830} > "Document<stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>>" > > fields = {ArrayList@5920} size = 1 > > 0 = {StringField@5922} > "stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>" > > ``` > > But once in the update method (document being retrieved) I see it > changes to StoredField and is already "tokenized": > > ``` > > doc = {Document@6526} > > "Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1> > stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1> > stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG> > docValuesType=NUMERIC<uid:1> LongPoint <uid:1> stored<uid:1>>" > > fields = {ArrayList@6548} size = 6 > > 0 = {StoredField@6550} > "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>" > > 1 = {StoredField@6551} > "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>" > > 2 = {StringField@6552} > "stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>" > > 3 = {NumericDocValuesField@6553} "docValuesType=NUMERIC<uid:1>" > > 4 = {LongPoint@6554} "LongPoint <uid:1>" > > 5 = {StoredField@6555} "stored<uid:1>" > > ``` > > The code that adds the documents - it's a method implemented in James: > > `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex#add` > ( > > https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1240 > ) that looks fairly straightforward: > > ``` > > public Mono<Void> add(MailboxSession session, Mailbox mailbox, > MailboxMessage membership) { > > return Mono.fromRunnable(Throwing.runnable(() -> { > > Document doc = createMessageDocument(session, > membership); > > Document flagsDoc = createFlagsDocument(membership); > > writer.addDocument(doc); > > writer.addDocument(flagsDoc); > > })); > > } > > ``` > > similarly to actual method that creates the flags > ( > https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1290 > ): > ``` > > private Document createFlagsDocument(MailboxMessage message) { > > Document doc = new Document(); > > doc.add(new StringField(ID_FIELD, "flags-" + > message.getMailboxId().serialize() + "-" + > Long.toString(message.getUid().asLong()), Store.YES)); > > doc.add(new StringField(MAILBOX_ID_FIELD, > message.getMailboxId().serialize(), Store.YES)); > > doc.add(new NumericDocValuesField(UID_FIELD, > message.getUid().asLong())); > > doc.add(new LongPoint(UID_FIELD, message.getUid().asLong())); > > doc.add(new StoredField(UID_FIELD, message.getUid().asLong())); > > indexFlags(doc, message.createFlags()); > > return doc; > > } > > ``` > > As you can see `StringField` is used when creating the document and to > the best of my knowledge and based on what I was told - it _should_ > not be tokenized (?). > > Update (in which the document can't be updated because Term seems to > be not finding it) is done in > `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex#update()` > ( > https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1259 > ): > > ``` > > private void update(MailboxId mailboxId, MessageUid uid, Flags f) > throws IOException { > > try (IndexReader reader = DirectoryReader.open(writer)) { > > IndexSearcher searcher = new IndexSearcher(reader); > > BooleanQuery.Builder queryBuilder = new > BooleanQuery.Builder(); > > queryBuilder.add(new TermQuery(new > Term(MAILBOX_ID_FIELD, mailboxId.serialize())), > BooleanClause.Occur.MUST); > > queryBuilder.add(createQuery(MessageRange.one(uid)), > BooleanClause.Occur.MUST); > > queryBuilder.add(new PrefixQuery(new Term(FLAGS_FIELD, > "")), BooleanClause.Occur.MUST); > > TopDocs docs = searcher.search(queryBuilder.build(), > 100000); > > ScoreDoc[] sDocs = docs.scoreDocs; > > for (ScoreDoc sDoc : sDocs) { > > Document doc = searcher.doc(sDoc.doc); > > doc.removeFields(FLAGS_FIELD); > > indexFlags(doc, f); > > // somehow the document getting from the search > lost DocValues data for the uid field, we need to re-define the field > with proper DocValues. > > long uidValue = > doc.getField("uid").numericValue().longValue(); > > doc.removeField("uid"); > > doc.add(new NumericDocValuesField(UID_FIELD, > uidValue)); > > doc.add(new LongPoint(UID_FIELD, uidValue)); > > doc.add(new StoredField(UID_FIELD, uidValue)); > > writer.updateDocument(new Term(ID_FIELD, > doc.get(ID_FIELD)), doc); > > } > > } > > } > > ``` > > I was wondering if Lucene/writer configuration is not a culprit (that > would result in tokenizing even StringField) but it looks fairly > straightforward: > > ``` > > this.directory = directory; > > this.writer = new IndexWriter(this.directory, > createConfig(createAnalyzer(lenient), dropIndexOnStart)); > > ``` > > where createConfig looks like this: > > ``` > > protected IndexWriterConfig createConfig(Analyzer analyzer, boolean > dropIndexOnStart) { > > IndexWriterConfig config = new IndexWriterConfig(analyzer); > > if (dropIndexOnStart) { > > config.setOpenMode(OpenMode.CREATE); > > } else { > > config.setOpenMode(OpenMode.CREATE_OR_APPEND); > > } > > return config; > > } > > ``` > > and createAnalyzer like this: > > ``` > > protected Analyzer createAnalyzer(boolean lenient) { > > if (lenient) { > > return new LenientImapSearchAnalyzer(); > > } else { > > return new StrictImapSearchAnalyzer(); > > } > > } > > ``` > > On 2024-08-10T21:04:15.000+02:00, Gautam Worah > <worah.gau...@gmail.com> wrote: > > > Hey, > > > > I don't think I understand the email well but I'll try my best. > > > > > In your printed docs, I see that the flag data is still tokenized. See the > > > > string that you printed: DOCS<id:flags-1-1> > > > > > stored,indexed,tokenized,omitNorms. What does your code for adding the doc > > > > look like? > > > > Are you using StringField for adding the field to the doc? > > > > > I think this is why when you re-add the field with a StringField, the test > > > > works. > > > > Lucene's StandardTokenizer for 9.11 uses the Unicode Text Segmentation > > > > algorithm, as specified in Unicode Standard Annex #29 > > > > <http://unicode.org/reports/tr29/> [http://unicode.org/reports/tr29/>];. > > > > That standard contains a "-" as a word breaker. > > > > I guess that is what is breaking your code. > > > > > You are using Lucene's NRT for your search. In general, for debugging such > > > > cases, I add an IndexWriter.commit() after you are done updating the doc, > > > > and see if it fixes things. > > > > > If it does, then it has something to do with NRT, and deleting docs etc. If > > > > > not, then that means that your query/data is wrong somewhere. This is how I > > > > debugged your first problem. > > > > Best, > > > > Gautam Worah. > > > > On Sat, Aug 10, 2024 at 4:17 AM Wojtek <woj...@unir.se> wrote: > > > >> Addendum, output is: > >> > >> ``` > >> > >> maxDoc: 3 > >> > >> maxDoc (after second flag): 3 > >> > > >>> > >>> Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1> > >>> > >>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1> > >>> > >>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\FLAG> > >>> > >>>> stored<uid:1>> > >> > > >>> > >>> Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1> > >>> > >>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1> > >>> > > >>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\ANSWERED> > >>> > >>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\FLAG> > >>> > >>>> stored<uid:1>> > >> > >> Term search: 0 items: [] > >> > >> ``` > >> > >> Though after a bit more digging in I think I found the issue in > >> the > >> > >> James-Lucene code in the update method > >> > >> ( > >> > >> > https://github.com/apache/james-project/pull/2342/files#diff-a7c2a3c5cdb7e4a2914c899409991e27df6b25ad54488f197bc533193e3a03d0R1267 > >> > >> ) > >> > >> There is a comment there that UID values are missing from the > >> > >> retrieved document and they have to be re-added (otherwise an > >> > >> exception about type being NULL is thrown while trying to > >> update): > >> > >> ``` > >> > >> // somehow the document getting from > >> > >> the search lost DocValues data for the uid field, we need to > >> re-define > >> > >> the field with proper DocValues. > >> > >> long uidValue = > >> > >> doc.getField("uid").numericValue().longValue(); > >> > >> doc.removeField("uid"); > >> > >> doc.add(new > >> > >> NumericDocValuesField(UID_FIELD, uidValue)); > >> > >> doc.add(new LongPoint(UID_FIELD, > >> > >> uidValue)); > >> > >> doc.add(new StoredField(UID_FIELD, > >> > >> uidValue)); > >> > >> ``` > >> > >> It seems that the `ID_FIELD` is somehow also missing (even though > >> it's > >> > >> output in the debugging `.toString()` thus later on the term > >> search > >> > >> with `new Term(ID_FIELD, doc.get(ID_FIELD))` yield 0 results. > >> > >> When I re-add the field manually like the UID fileds: > >> > >> ``` > >> > >> final String text = doc.get(ID_FIELD); > >> > >> doc.add(new StringField(ID_FIELD, text, > >> > >> Store.YES)); > >> > >> ``` > >> > >> then subsequent updating works (because the term subsequently > >> matches > >> > >> the ID_FIELD) > >> > >> So the question seems to boild down to: > >> > >> 1) why we have to re-define those fields as they seem to be > >> missing > >> > >> from the retrieved searched document with: > >> > >> ``` > >> > >> TopDocs docs = searcher.search(queryBuilder.build > >> [http://searcher.search(queryBuilder.build](), 100000); > >> > >> ScoreDoc[] sDocs = docs.scoreDocs; > >> > >> for (ScoreDoc sDoc : sDocs) { > >> > >> Document doc = searcher.doc(sDoc.doc); > >> > >> ```` > >> > >> 2) if they are missing, why they are included in the document > >> > >> (`.toString()`) output? > >> > >> On 2024-08-10T12:09:29.000+02:00, Wojtek <woj...@unir.se> wrote: > >> > >>> Thank you Gautam! > >>> > >>> This works. Now I went back to Lucene and I'm hitting the wall. > >>> > >>> In James they set document with "id" being constructed as > >>> > >>>>> "flag-<uid>-<uid>" (e.g. "<id:flags-1-1>"). > >>> > >>> I run the code that updates the documents with flags and > >>> afterwards > >>> > >>> check the result. The code simple code I use new reader from > >>> the > >>> > >>> writer (so it should be OK and should have new state): > >>> > >>> ``` > >>> > >>> try (IndexReader reader = > >>> > >>> DirectoryReader.open(luceneMessageSearchIndex.writer) > >>> [http://DirectoryReader.open(luceneMessageSearchIndex.writer)] [ > >> > >> http://DirectoryReader.open(luceneMessageSearchIndex.writer)]) { > >> > >>> System.out.println("maxDoc: " + reader.maxDoc()); > >>> > >>> IndexSearcher searcher = new IndexSearcher(reader); > >>> > >>> System.out.println("maxDoc (after second flag): " + > >>> > >>> reader.maxDoc()); > >>> > >>> // starting from "1" to avoid main mail document > >>> > >>> for (int i = 1; i < reader.maxDoc(); i++) { > >>> > >>> System.out.println(reader.storedFields().document(i)); > >>> > >>> } > >>> > >>> var idQuery = new TermQuery(new Term("id", "flags-1-1")); > >>> > >>> var search = searcher.search(idQuery > >>> [http://searcher.search(idQuery] [http://searcher.search > >> > >> (idQuery], 10000); > >> > >>> System.out.println("Term search: " + search.scoreDocs.length + > >>> > >>> " items: " + Arrays.toString(search.scoreDocs)); > >>> > >>> } > >>> > >>> ``` > >>> > >>> and the output is following: > >>> > >>> ``` > >>> > >>> try (IndexReader reader = > >>> > >>> DirectoryReader.open(luceneMessageSearchIndex.writer) > >>> [http://DirectoryReader.open(luceneMessageSearchIndex.writer)] [ > >> > >> http://DirectoryReader.open(luceneMessageSearchIndex.writer)]) { > >> > >>> System.out.println("maxDoc: " + reader.maxDoc()); > >>> > >>> IndexSearcher searcher = new IndexSearcher(reader); > >>> > >>> System.out.println("maxDoc (after second flag): " + > >>> > >>> reader.maxDoc()); > >>> > >>> // starting from "1" to avoid main mail document > >>> > >>> for (int i = 1; i < reader.maxDoc(); i++) { > >>> > >>> System.out.println(reader.storedFields().document(i)); > >>> > >>> } > >>> > >>> var idQuery = new TermQuery(new Term("id", "flags-1-1")); > >>> > >>> var search = searcher.search(idQuery > >>> [http://searcher.search(idQuery] [http://searcher.search > >> > >> (idQuery], 10000); > >> > >>> System.out.println("Term search: " + search.scoreDocs.length + > >>> > >>> " items: " + Arrays.toString(search.scoreDocs)); > >>> > >>> } > >>> > >>> ``` > >>> > >>> So even though I search for term with "flags-1-1" it yields 0 > >>> results > >>> > >>> (but there are 2 documents with such ID already). > >>> > >>> The gist of the issue is that for some reasons when trying to > >>> update > >>> > >>> flags document instead of updating it (deleting/adding) it's > >>> only > >>> > >>> being added. My reasoning is that for some reason there is an > >>> issue > >>> > >>> with the term matching to the field so the update "fails" (it > >>> adds new > >>> > >>> document for same term) when updating the document: > >> > >> > https://github.com/apache/james-project/pull/2342/files#diff-a7c2a3c5cdb7e4a2914c899409991e27df6b25ad54488f197bc533193e3a03d0R1267 > >> > >>> The code looks ok, while debuging the term yields: "id: > >>> flags-1-1" > >>> > >>> so it looks OK (but it's only visual string comparison . I > >>> thought > >>> > >>> that it could be the same issue with tokenizer but everywhere > >>> in the > >>> > >>> code StringField is used for the id of the flags: > >>> > >>> ``` > >>> > >>> private Document createFlagsDocument(MailboxMessage message) { > >>> > >>> Document doc = new Document(); > >>> > >>> doc.add(new StringField(ID_FIELD, "flags-" + > >>> > >>> message.getMailboxId().serialize() + "-" + > >>> > >>> Long.toString(message.getUid().asLong()), Store.YES)); > >>> > >>> … > >>> > >>> ``` > >>> > >>> So the update based on > >>> > >>> ``` > >>> > >>> new Term(ID_FIELD, doc.get(ID_FIELD)) > >>> > >>> ``` > >>> > >>> Should hit that exact document - correct? > >>> > >>> Any pointers on how to debug that and see how/where the > >>> comparison is > >>> > >>> done so I could maybe try to figure out why it doesn't match > >>> the > >>> > >>> documents which causes the update to fail will be greatly > >>> appreciated! > >>> > >>> (I've been at it for a couple of days now and while I learned a > >>> great > >>> > >>> deal about Lucene, starting from absolutely zero knowledge, I > >>> think > >>> > >>> I'm in over my head and stepping into Lucene with debugger > >>> doesn't > >>> > >>> help much as I don't know exactly what/where to look for :) ) > >>> > >>> w. > >>> > >>> On 2024-08-10T10:21:21.000+02:00, Gautam Worah > >>> > >>> <worah.gau...@gmail.com> wrote: > >>> > >>>> Hey, > >>>> > >>>> Use a StringField instead of a TextField for the title and > >>>> your > >>>> > >>>> test will > >>>> > >>>> pass. > >>>> > >>>> Tokenization which is enabled for TextFields, is breaking > >>>> your > >>>> > >>>> fancy title > >>>> > >>>> into tokens split by spaces, which is causing your docs to > >>>> not > >>>> > >>>> match. > >> > >> > https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/document/StringField.html > >> > >>>> Best, > >>>> > >>>> Gautam Worah. > >>>> > >>>> On Sat, Aug 10, 2024 at 12:05 AM Wojtek <woj...@unir.se> > >>>> wrote: > >>>> > >>>>> Hi Froh, > >>>>> > >>>>> thank you for the information. > >>>>> > >>>>> I updated the code and re-open the reader - it seems that > >>>>> the > >>>>> > >>>>> update > >>>>> > >>>>> is reflected and search for old document doesn't yield > >>>>> anything > >>>>> > >>>>> but > >>>>> > >>>>> the search for new term fails. > >>>>> > >>>>> I output all documents (there are 2) and the second one has > >>>>> new > >>>>> > >>>>> title > >>>>> > >>>>> but when searching for it no document is found even though > >>>>> it's > >>>>> > >>>>> the > >>>>> > >>>>> same string that has been used to update the title. > >>>>> > >>>>> On 2024-08-10T01:21:39.000+02:00, Michael Froh > >>>>> > >>>>> <msf...@gmail.com> > >>>>> > >>>>> wrote: > >>>>> > >>>>>> Hi Wojtek, > >>>>>> > >>>>>> Thank you for linking to your test code! > >>>>>> > >>>>>> When you open an IndexReader, it is locked to the view of > >>>>>> the > >>>>>> > >>>>>> Lucene > >>>>>> > >>>>>> directory at the time that it's opened. > >>>>>> > >>>>>> If you make changes, you'll need to open a new > >>>>>> IndexReader > >>>>>> > >>>>>> before those > >>>>> > >>>>>> changes are visible. I see that you tried creating a new > >>>>>> > >>>>>> IndexSearcher, but > >>>>>> > >>>>>> unfortunately that's not sufficient. > >>>>>> > >>>>>> Hope that helps! > >>>>>> > >>>>>> Froh > >>>>>> > >>>>>> On Fri, Aug 9, 2024 at 3:25 PM Wojtek <woj...@unir.se> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi all! > >>>>>>> > >>>>>>> There is an effort in Apache James to update to a more > >>>>>>> > >>>>>>> modern > >>>>>>> > >>>>>>> version of > >>>>>>> > >>>>>>> Lucene (ref: > >>>>>>> > >>>>>>> https://github.com/apache/james-project/pull/2342). I'm > >>>>>>> > >>>>>>> digging > >>>>>>> > >>>>>>> into the > >>>>>>> > >>>>>>> issue as other have done > >>>>>>> > >>>>>>> but I'm stumped - it seems that > >>>>>>> > >>>>>>> `org.apache.lucene.index.IndexWriter#updateDocument` > >>>>>>> > >>>>>>> doesn't > >>>>>>> > >>>>>>> update > >>>>>>> > >>>>>>> the document. > >>>>>>> > >>>>>>> Documentation > >>>>>>> > >>>>>>> ( > >> > >> > https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,java.lang.Iterable) > >> > >>>>> ) > >>>>> > >>>>>>> states: > >>>>>>> > >>>>>>> Updates a document by first deleting the document(s) > >>>>>>> > >>>>>>> containing > >>>>>>> > >>>>>>> term > >>>>>>> > >>>>>>> and then adding the new > >>>>>>> > >>>>>>> document. The delete and then add are atomic as seen by > >>>>>>> a > >>>>>>> > >>>>>>> reader > >>>>>>> > >>>>>>> on the > >>>>>>> > >>>>>>> same index (flush may happen > >>>>>>> > >>>>>>> only after the add). > >>>>>>> > >>>>>>> Here is a simple test with it: > >> > >> > https://github.com/woj-tek/lucene-update-test/blob/master/src/test/java/se/unir/AppTest.java > >> > >>>>>>> but it fails. > >>>>>>> > >>>>>>> Any guidance would be appreciated because I (and > >>>>>>> others) > >>>>>>> > >>>>>>> have > >>>>>>> > >>>>>>> been hitting > >>>>>>> > >>>>>>> wall with it :) > >>>>>>> > >>>>>>> -- > >>>>>>> > >>>>>>> Wojtek > >> > > >>>>>>> > >>>>>>> --------------------------------------------------------------------- > >>>>>>> > >>>>>>> To unsubscribe, e-mail: > >>>>>>> > >>>>>>> java-user-unsubscr...@lucene.apache.org > >>>>>>> > >>>>>>> For additional commands, e-mail: > >>>>>>> > >>>>>>> java-user-h...@lucene.apache.org > >