Hi, thank you for reply and apologies for being somewhat "all over the place".
Regarding "tokenization" - should it happen if I use StringField? When the document is created (before writing) i see in the debugger it's not tokenized and is of type StringField: ``` doc = {Document@4830} "Document<stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>>" fields = {ArrayList@5920} size = 1 0 = {StringField@5922} "stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>" ``` But once in the update method (document being retrieved) I see it changes to StoredField and is already "tokenized": ``` doc = {Document@6526} "Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1> stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG> docValuesType=NUMERIC<uid:1> LongPoint <uid:1> stored<uid:1>>" fields = {ArrayList@6548} size = 6 0 = {StoredField@6550} "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>" 1 = {StoredField@6551} "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>" 2 = {StringField@6552} "stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>" 3 = {NumericDocValuesField@6553} "docValuesType=NUMERIC<uid:1>" 4 = {LongPoint@6554} "LongPoint <uid:1>" 5 = {StoredField@6555} "stored<uid:1>" ``` The code that adds the documents - it's a method implemented in James: `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex#add` ( https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1240 ) that looks fairly straightforward: ``` public Mono<Void> add(MailboxSession session, Mailbox mailbox, MailboxMessage membership) { return Mono.fromRunnable(Throwing.runnable(() -> { Document doc = createMessageDocument(session, membership); Document flagsDoc = createFlagsDocument(membership); writer.addDocument(doc); writer.addDocument(flagsDoc); })); } ``` similarly to actual method that creates the flags (https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1290): ``` private Document createFlagsDocument(MailboxMessage message) { Document doc = new Document(); doc.add(new StringField(ID_FIELD, "flags-" + message.getMailboxId().serialize() + "-" + Long.toString(message.getUid().asLong()), Store.YES)); doc.add(new StringField(MAILBOX_ID_FIELD, message.getMailboxId().serialize(), Store.YES)); doc.add(new NumericDocValuesField(UID_FIELD, message.getUid().asLong())); doc.add(new LongPoint(UID_FIELD, message.getUid().asLong())); doc.add(new StoredField(UID_FIELD, message.getUid().asLong())); indexFlags(doc, message.createFlags()); return doc; } ``` As you can see `StringField` is used when creating the document and to the best of my knowledge and based on what I was told - it _should_ not be tokenized (?). Update (in which the document can't be updated because Term seems to be not finding it) is done in `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex#update()` (https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1259): ``` private void update(MailboxId mailboxId, MessageUid uid, Flags f) throws IOException { try (IndexReader reader = DirectoryReader.open(writer)) { IndexSearcher searcher = new IndexSearcher(reader); BooleanQuery.Builder queryBuilder = new BooleanQuery.Builder(); queryBuilder.add(new TermQuery(new Term(MAILBOX_ID_FIELD, mailboxId.serialize())), BooleanClause.Occur.MUST); queryBuilder.add(createQuery(MessageRange.one(uid)), BooleanClause.Occur.MUST); queryBuilder.add(new PrefixQuery(new Term(FLAGS_FIELD, "")), BooleanClause.Occur.MUST); TopDocs docs = searcher.search(queryBuilder.build(), 100000); ScoreDoc[] sDocs = docs.scoreDocs; for (ScoreDoc sDoc : sDocs) { Document doc = searcher.doc(sDoc.doc); doc.removeFields(FLAGS_FIELD); indexFlags(doc, f); // somehow the document getting from the search lost DocValues data for the uid field, we need to re-define the field with proper DocValues. long uidValue = doc.getField("uid").numericValue().longValue(); doc.removeField("uid"); doc.add(new NumericDocValuesField(UID_FIELD, uidValue)); doc.add(new LongPoint(UID_FIELD, uidValue)); doc.add(new StoredField(UID_FIELD, uidValue)); writer.updateDocument(new Term(ID_FIELD, doc.get(ID_FIELD)), doc); } } } ``` I was wondering if Lucene/writer configuration is not a culprit (that would result in tokenizing even StringField) but it looks fairly straightforward: ``` this.directory = directory; this.writer = new IndexWriter(this.directory, createConfig(createAnalyzer(lenient), dropIndexOnStart)); ``` where createConfig looks like this: ``` protected IndexWriterConfig createConfig(Analyzer analyzer, boolean dropIndexOnStart) { IndexWriterConfig config = new IndexWriterConfig(analyzer); if (dropIndexOnStart) { config.setOpenMode(OpenMode.CREATE); } else { config.setOpenMode(OpenMode.CREATE_OR_APPEND); } return config; } ``` and createAnalyzer like this: ``` protected Analyzer createAnalyzer(boolean lenient) { if (lenient) { return new LenientImapSearchAnalyzer(); } else { return new StrictImapSearchAnalyzer(); } } ``` On 2024-08-10T21:04:15.000+02:00, Gautam Worah <worah.gau...@gmail.com> wrote: > Hey, > > I don't think I understand the email well but I'll try my best. > > In your printed docs, I see that the flag data is still tokenized. See the > > string that you printed: DOCS<id:flags-1-1> > > stored,indexed,tokenized,omitNorms. What does your code for adding the doc > > look like? > > Are you using StringField for adding the field to the doc? > > I think this is why when you re-add the field with a StringField, the test > > works. > > Lucene's StandardTokenizer for 9.11 uses the Unicode Text Segmentation > > algorithm, as specified in Unicode Standard Annex #29 > > <http://unicode.org/reports/tr29/> [http://unicode.org/reports/tr29/>];. > > That standard contains a "-" as a word breaker. > > I guess that is what is breaking your code. > > You are using Lucene's NRT for your search. In general, for debugging such > > cases, I add an IndexWriter.commit() after you are done updating the doc, > > and see if it fixes things. > > If it does, then it has something to do with NRT, and deleting docs etc. If > > not, then that means that your query/data is wrong somewhere. This is how I > > debugged your first problem. > > Best, > > Gautam Worah. > > On Sat, Aug 10, 2024 at 4:17 AM Wojtek <woj...@unir.se> wrote: > >> Addendum, output is: >> >> ``` >> >> maxDoc: 3 >> >> maxDoc (after second flag): 3 >> >>> >>>Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1> >>> >>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1> >>> >>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\FLAG> >>> >>>> stored<uid:1>> >> >>> >>>Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1> >>> >>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1> >>> >>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\ANSWERED> >>> >>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\FLAG> >>> >>>> stored<uid:1>> >> >> Term search: 0 items: [] >> >> ``` >> >> Though after a bit more digging in I think I found the issue in >> the >> >> James-Lucene code in the update method >> >> ( >> >> >>https://github.com/apache/james-project/pull/2342/files#diff-a7c2a3c5cdb7e4a2914c899409991e27df6b25ad54488f197bc533193e3a03d0R1267 >> >> ) >> >> There is a comment there that UID values are missing from the >> >> retrieved document and they have to be re-added (otherwise an >> >> exception about type being NULL is thrown while trying to >> update): >> >> ``` >> >> // somehow the document getting from >> >> the search lost DocValues data for the uid field, we need to >> re-define >> >> the field with proper DocValues. >> >> long uidValue = >> >> doc.getField("uid").numericValue().longValue(); >> >> doc.removeField("uid"); >> >> doc.add(new >> >> NumericDocValuesField(UID_FIELD, uidValue)); >> >> doc.add(new LongPoint(UID_FIELD, >> >> uidValue)); >> >> doc.add(new StoredField(UID_FIELD, >> >> uidValue)); >> >> ``` >> >> It seems that the `ID_FIELD` is somehow also missing (even though >> it's >> >> output in the debugging `.toString()` thus later on the term >> search >> >> with `new Term(ID_FIELD, doc.get(ID_FIELD))` yield 0 results. >> >> When I re-add the field manually like the UID fileds: >> >> ``` >> >> final String text = doc.get(ID_FIELD); >> >> doc.add(new StringField(ID_FIELD, text, >> >> Store.YES)); >> >> ``` >> >> then subsequent updating works (because the term subsequently >> matches >> >> the ID_FIELD) >> >> So the question seems to boild down to: >> >> 1) why we have to re-define those fields as they seem to be >> missing >> >> from the retrieved searched document with: >> >> ``` >> >> TopDocs docs = searcher.search(queryBuilder.build >> [http://searcher.search(queryBuilder.build](), 100000); >> >> ScoreDoc[] sDocs = docs.scoreDocs; >> >> for (ScoreDoc sDoc : sDocs) { >> >> Document doc = searcher.doc(sDoc.doc); >> >> ```` >> >> 2) if they are missing, why they are included in the document >> >> (`.toString()`) output? >> >> On 2024-08-10T12:09:29.000+02:00, Wojtek <woj...@unir.se> wrote: >> >>> Thank you Gautam! >>> >>> This works. Now I went back to Lucene and I'm hitting the wall. >>> >>> In James they set document with "id" being constructed as >>> >>>>> "flag-<uid>-<uid>" (e.g. "<id:flags-1-1>"). >>> >>> I run the code that updates the documents with flags and >>> afterwards >>> >>> check the result. The code simple code I use new reader from >>> the >>> >>> writer (so it should be OK and should have new state): >>> >>> ``` >>> >>> try (IndexReader reader = >>> >>> DirectoryReader.open(luceneMessageSearchIndex.writer) >>> [http://DirectoryReader.open(luceneMessageSearchIndex.writer)] [ >> >> http://DirectoryReader.open(luceneMessageSearchIndex.writer)]) { >> >>> System.out.println("maxDoc: " + reader.maxDoc()); >>> >>> IndexSearcher searcher = new IndexSearcher(reader); >>> >>> System.out.println("maxDoc (after second flag): " + >>> >>> reader.maxDoc()); >>> >>> // starting from "1" to avoid main mail document >>> >>> for (int i = 1; i < reader.maxDoc(); i++) { >>> >>> System.out.println(reader.storedFields().document(i)); >>> >>> } >>> >>> var idQuery = new TermQuery(new Term("id", "flags-1-1")); >>> >>> var search = searcher.search(idQuery >>> [http://searcher.search(idQuery] [http://searcher.search >> >> (idQuery], 10000); >> >>> System.out.println("Term search: " + search.scoreDocs.length + >>> >>> " items: " + Arrays.toString(search.scoreDocs)); >>> >>> } >>> >>> ``` >>> >>> and the output is following: >>> >>> ``` >>> >>> try (IndexReader reader = >>> >>> DirectoryReader.open(luceneMessageSearchIndex.writer) >>> [http://DirectoryReader.open(luceneMessageSearchIndex.writer)] [ >> >> http://DirectoryReader.open(luceneMessageSearchIndex.writer)]) { >> >>> System.out.println("maxDoc: " + reader.maxDoc()); >>> >>> IndexSearcher searcher = new IndexSearcher(reader); >>> >>> System.out.println("maxDoc (after second flag): " + >>> >>> reader.maxDoc()); >>> >>> // starting from "1" to avoid main mail document >>> >>> for (int i = 1; i < reader.maxDoc(); i++) { >>> >>> System.out.println(reader.storedFields().document(i)); >>> >>> } >>> >>> var idQuery = new TermQuery(new Term("id", "flags-1-1")); >>> >>> var search = searcher.search(idQuery >>> [http://searcher.search(idQuery] [http://searcher.search >> >> (idQuery], 10000); >> >>> System.out.println("Term search: " + search.scoreDocs.length + >>> >>> " items: " + Arrays.toString(search.scoreDocs)); >>> >>> } >>> >>> ``` >>> >>> So even though I search for term with "flags-1-1" it yields 0 >>> results >>> >>> (but there are 2 documents with such ID already). >>> >>> The gist of the issue is that for some reasons when trying to >>> update >>> >>> flags document instead of updating it (deleting/adding) it's >>> only >>> >>> being added. My reasoning is that for some reason there is an >>> issue >>> >>> with the term matching to the field so the update "fails" (it >>> adds new >>> >>> document for same term) when updating the document: >> >> >>https://github.com/apache/james-project/pull/2342/files#diff-a7c2a3c5cdb7e4a2914c899409991e27df6b25ad54488f197bc533193e3a03d0R1267 >> >>> The code looks ok, while debuging the term yields: "id: >>> flags-1-1" >>> >>> so it looks OK (but it's only visual string comparison . I >>> thought >>> >>> that it could be the same issue with tokenizer but everywhere >>> in the >>> >>> code StringField is used for the id of the flags: >>> >>> ``` >>> >>> private Document createFlagsDocument(MailboxMessage message) { >>> >>> Document doc = new Document(); >>> >>> doc.add(new StringField(ID_FIELD, "flags-" + >>> >>> message.getMailboxId().serialize() + "-" + >>> >>> Long.toString(message.getUid().asLong()), Store.YES)); >>> >>> … >>> >>> ``` >>> >>> So the update based on >>> >>> ``` >>> >>> new Term(ID_FIELD, doc.get(ID_FIELD)) >>> >>> ``` >>> >>> Should hit that exact document - correct? >>> >>> Any pointers on how to debug that and see how/where the >>> comparison is >>> >>> done so I could maybe try to figure out why it doesn't match >>> the >>> >>> documents which causes the update to fail will be greatly >>> appreciated! >>> >>> (I've been at it for a couple of days now and while I learned a >>> great >>> >>> deal about Lucene, starting from absolutely zero knowledge, I >>> think >>> >>> I'm in over my head and stepping into Lucene with debugger >>> doesn't >>> >>> help much as I don't know exactly what/where to look for :) ) >>> >>> w. >>> >>> On 2024-08-10T10:21:21.000+02:00, Gautam Worah >>> >>> <worah.gau...@gmail.com> wrote: >>> >>>> Hey, >>>> >>>> Use a StringField instead of a TextField for the title and >>>> your >>>> >>>> test will >>>> >>>> pass. >>>> >>>> Tokenization which is enabled for TextFields, is breaking >>>> your >>>> >>>> fancy title >>>> >>>> into tokens split by spaces, which is causing your docs to >>>> not >>>> >>>> match. >> >> >>https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/document/StringField.html >> >>>> Best, >>>> >>>> Gautam Worah. >>>> >>>> On Sat, Aug 10, 2024 at 12:05 AM Wojtek <woj...@unir.se> >>>> wrote: >>>> >>>>> Hi Froh, >>>>> >>>>> thank you for the information. >>>>> >>>>> I updated the code and re-open the reader - it seems that >>>>> the >>>>> >>>>> update >>>>> >>>>> is reflected and search for old document doesn't yield >>>>> anything >>>>> >>>>> but >>>>> >>>>> the search for new term fails. >>>>> >>>>> I output all documents (there are 2) and the second one has >>>>> new >>>>> >>>>> title >>>>> >>>>> but when searching for it no document is found even though >>>>> it's >>>>> >>>>> the >>>>> >>>>> same string that has been used to update the title. >>>>> >>>>> On 2024-08-10T01:21:39.000+02:00, Michael Froh >>>>> >>>>> <msf...@gmail.com> >>>>> >>>>> wrote: >>>>> >>>>>> Hi Wojtek, >>>>>> >>>>>> Thank you for linking to your test code! >>>>>> >>>>>> When you open an IndexReader, it is locked to the view of >>>>>> the >>>>>> >>>>>> Lucene >>>>>> >>>>>> directory at the time that it's opened. >>>>>> >>>>>> If you make changes, you'll need to open a new >>>>>> IndexReader >>>>>> >>>>>> before those >>>>> >>>>>> changes are visible. I see that you tried creating a new >>>>>> >>>>>> IndexSearcher, but >>>>>> >>>>>> unfortunately that's not sufficient. >>>>>> >>>>>> Hope that helps! >>>>>> >>>>>> Froh >>>>>> >>>>>> On Fri, Aug 9, 2024 at 3:25 PM Wojtek <woj...@unir.se> >>>>>> wrote: >>>>>> >>>>>>> Hi all! >>>>>>> >>>>>>> There is an effort in Apache James to update to a more >>>>>>> >>>>>>> modern >>>>>>> >>>>>>> version of >>>>>>> >>>>>>> Lucene (ref: >>>>>>> >>>>>>> https://github.com/apache/james-project/pull/2342). I'm >>>>>>> >>>>>>> digging >>>>>>> >>>>>>> into the >>>>>>> >>>>>>> issue as other have done >>>>>>> >>>>>>> but I'm stumped - it seems that >>>>>>> >>>>>>> `org.apache.lucene.index.IndexWriter#updateDocument` >>>>>>> >>>>>>> doesn't >>>>>>> >>>>>>> update >>>>>>> >>>>>>> the document. >>>>>>> >>>>>>> Documentation >>>>>>> >>>>>>> ( >> >> >>https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,java.lang.Iterable) >> >>>>> ) >>>>> >>>>>>> states: >>>>>>> >>>>>>> Updates a document by first deleting the document(s) >>>>>>> >>>>>>> containing >>>>>>> >>>>>>> term >>>>>>> >>>>>>> and then adding the new >>>>>>> >>>>>>> document. The delete and then add are atomic as seen by >>>>>>> a >>>>>>> >>>>>>> reader >>>>>>> >>>>>>> on the >>>>>>> >>>>>>> same index (flush may happen >>>>>>> >>>>>>> only after the add). >>>>>>> >>>>>>> Here is a simple test with it: >> >> >>https://github.com/woj-tek/lucene-update-test/blob/master/src/test/java/se/unir/AppTest.java >> >>>>>>> but it fails. >>>>>>> >>>>>>> Any guidance would be appreciated because I (and >>>>>>> others) >>>>>>> >>>>>>> have >>>>>>> >>>>>>> been hitting >>>>>>> >>>>>>> wall with it :) >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Wojtek >> >>>>>>> >>>>>>>--------------------------------------------------------------------- >>>>>>> >>>>>>> To unsubscribe, e-mail: >>>>>>> >>>>>>> java-user-unsubscr...@lucene.apache.org >>>>>>> >>>>>>> For additional commands, e-mail: >>>>>>> >>>>>>> java-user-h...@lucene.apache.org