I'm confused as to what could be happening.
Google led me to this StackOverflow link:
https://stackoverflow.com/questions/36402235/lucene-stringfield-gets-tokenized-when-doc-is-retrieved-and-stored-again
which references some longstanding old issues about fields changing their
"types" and so on.
The docs mention: `NOTE: only the content of a field is returned if that
field was stored during indexing. Metadata like boost, omitNorm,
IndexOptions, tokenized, etc., are not preserved.`
Can you check what `doc.get(ID_FIELD)` returns, and if it looks right?
Maybe try a simple `new TermQuery(new Term("id", "flags-1-1"))` query
during update and see if it returns the correct ans?
If the value is not right, perhaps you may have to use the original stored
value:
https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/search/IndexSearcher.html#storedFields()
for crafting the `updateDocument()` call..
Best,
Gautam Worah.
On Sat, Aug 10, 2024 at 3:12 PM Wojtek <[email protected]> wrote:
> Hi,
>
> thank you for reply and apologies for being somewhat "all over the
> place".
>
> Regarding "tokenization" - should it happen if I use StringField?
>
> When the document is created (before writing) i see in the debugger
> it's not tokenized and is of type StringField:
>
> ```
>
> doc = {Document@4830}
> "Document<stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>>"
>
> fields = {ArrayList@5920} size = 1
>
> 0 = {StringField@5922}
> "stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>"
>
> ```
>
> But once in the update method (document being retrieved) I see it
> changes to StoredField and is already "tokenized":
>
> ```
>
> doc = {Document@6526}
>
> "Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
> stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>
> docValuesType=NUMERIC<uid:1> LongPoint <uid:1> stored<uid:1>>"
>
> fields = {ArrayList@6548} size = 6
>
> 0 = {StoredField@6550}
> "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>"
>
> 1 = {StoredField@6551}
> "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>"
>
> 2 = {StringField@6552}
> "stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>"
>
> 3 = {NumericDocValuesField@6553} "docValuesType=NUMERIC<uid:1>"
>
> 4 = {LongPoint@6554} "LongPoint <uid:1>"
>
> 5 = {StoredField@6555} "stored<uid:1>"
>
> ```
>
> The code that adds the documents - it's a method implemented in James:
>
> `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex#add`
> (
>
> https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1240
> ) that looks fairly straightforward:
>
> ```
>
> public Mono<Void> add(MailboxSession session, Mailbox mailbox,
> MailboxMessage membership) {
>
> return Mono.fromRunnable(Throwing.runnable(() -> {
>
> Document doc = createMessageDocument(session,
> membership);
>
> Document flagsDoc = createFlagsDocument(membership);
>
> writer.addDocument(doc);
>
> writer.addDocument(flagsDoc);
>
> }));
>
> }
>
> ```
>
> similarly to actual method that creates the flags
> (
> https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1290
> ):
> ```
>
> private Document createFlagsDocument(MailboxMessage message) {
>
> Document doc = new Document();
>
> doc.add(new StringField(ID_FIELD, "flags-" +
> message.getMailboxId().serialize() + "-" +
> Long.toString(message.getUid().asLong()), Store.YES));
>
> doc.add(new StringField(MAILBOX_ID_FIELD,
> message.getMailboxId().serialize(), Store.YES));
>
> doc.add(new NumericDocValuesField(UID_FIELD,
> message.getUid().asLong()));
>
> doc.add(new LongPoint(UID_FIELD, message.getUid().asLong()));
>
> doc.add(new StoredField(UID_FIELD, message.getUid().asLong()));
>
> indexFlags(doc, message.createFlags());
>
> return doc;
>
> }
>
> ```
>
> As you can see `StringField` is used when creating the document and to
> the best of my knowledge and based on what I was told - it _should_
> not be tokenized (?).
>
> Update (in which the document can't be updated because Term seems to
> be not finding it) is done in
> `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex#update()`
> (
> https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1259
> ):
>
> ```
>
> private void update(MailboxId mailboxId, MessageUid uid, Flags f)
> throws IOException {
>
> try (IndexReader reader = DirectoryReader.open(writer)) {
>
> IndexSearcher searcher = new IndexSearcher(reader);
>
> BooleanQuery.Builder queryBuilder = new
> BooleanQuery.Builder();
>
> queryBuilder.add(new TermQuery(new
> Term(MAILBOX_ID_FIELD, mailboxId.serialize())),
> BooleanClause.Occur.MUST);
>
> queryBuilder.add(createQuery(MessageRange.one(uid)),
> BooleanClause.Occur.MUST);
>
> queryBuilder.add(new PrefixQuery(new Term(FLAGS_FIELD,
> "")), BooleanClause.Occur.MUST);
>
> TopDocs docs = searcher.search(queryBuilder.build(),
> 100000);
>
> ScoreDoc[] sDocs = docs.scoreDocs;
>
> for (ScoreDoc sDoc : sDocs) {
>
> Document doc = searcher.doc(sDoc.doc);
>
> doc.removeFields(FLAGS_FIELD);
>
> indexFlags(doc, f);
>
> // somehow the document getting from the search
> lost DocValues data for the uid field, we need to re-define the field
> with proper DocValues.
>
> long uidValue =
> doc.getField("uid").numericValue().longValue();
>
> doc.removeField("uid");
>
> doc.add(new NumericDocValuesField(UID_FIELD,
> uidValue));
>
> doc.add(new LongPoint(UID_FIELD, uidValue));
>
> doc.add(new StoredField(UID_FIELD, uidValue));
>
> writer.updateDocument(new Term(ID_FIELD,
> doc.get(ID_FIELD)), doc);
>
> }
>
> }
>
> }
>
> ```
>
> I was wondering if Lucene/writer configuration is not a culprit (that
> would result in tokenizing even StringField) but it looks fairly
> straightforward:
>
> ```
>
> this.directory = directory;
>
> this.writer = new IndexWriter(this.directory,
> createConfig(createAnalyzer(lenient), dropIndexOnStart));
>
> ```
>
> where createConfig looks like this:
>
> ```
>
> protected IndexWriterConfig createConfig(Analyzer analyzer, boolean
> dropIndexOnStart) {
>
> IndexWriterConfig config = new IndexWriterConfig(analyzer);
>
> if (dropIndexOnStart) {
>
> config.setOpenMode(OpenMode.CREATE);
>
> } else {
>
> config.setOpenMode(OpenMode.CREATE_OR_APPEND);
>
> }
>
> return config;
>
> }
>
> ```
>
> and createAnalyzer like this:
>
> ```
>
> protected Analyzer createAnalyzer(boolean lenient) {
>
> if (lenient) {
>
> return new LenientImapSearchAnalyzer();
>
> } else {
>
> return new StrictImapSearchAnalyzer();
>
> }
>
> }
>
> ```
>
> On 2024-08-10T21:04:15.000+02:00, Gautam Worah
> <[email protected]> wrote:
>
> > Hey,
> >
> > I don't think I understand the email well but I'll try my best.
> >
>
> > In your printed docs, I see that the flag data is still tokenized. See the
> >
> > string that you printed: DOCS<id:flags-1-1>
> >
>
> > stored,indexed,tokenized,omitNorms. What does your code for adding the doc
> >
> > look like?
> >
> > Are you using StringField for adding the field to the doc?
> >
>
> > I think this is why when you re-add the field with a StringField, the test
> >
> > works.
> >
> > Lucene's StandardTokenizer for 9.11 uses the Unicode Text Segmentation
> >
> > algorithm, as specified in Unicode Standard Annex #29
> >
> > <http://unicode.org/reports/tr29/> [http://unicode.org/reports/tr29/>];.
> >
> > That standard contains a "-" as a word breaker.
> >
> > I guess that is what is breaking your code.
> >
>
> > You are using Lucene's NRT for your search. In general, for debugging such
> >
> > cases, I add an IndexWriter.commit() after you are done updating the doc,
> >
> > and see if it fixes things.
> >
>
> > If it does, then it has something to do with NRT, and deleting docs etc. If
> >
>
> > not, then that means that your query/data is wrong somewhere. This is how I
> >
> > debugged your first problem.
> >
> > Best,
> >
> > Gautam Worah.
> >
> > On Sat, Aug 10, 2024 at 4:17 AM Wojtek <[email protected]> wrote:
> >
> >> Addendum, output is:
> >>
> >> ```
> >>
> >> maxDoc: 3
> >>
> >> maxDoc (after second flag): 3
> >>
>
> >>>
> >>> Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
> >>>
> >>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
> >>>
> >>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\FLAG>
> >>>
> >>>> stored<uid:1>>
> >>
>
> >>>
> >>> Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
> >>>
> >>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
> >>>
>
> >>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\ANSWERED>
> >>>
> >>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\FLAG>
> >>>
> >>>> stored<uid:1>>
> >>
> >> Term search: 0 items: []
> >>
> >> ```
> >>
> >> Though after a bit more digging in I think I found the issue in
> >> the
> >>
> >> James-Lucene code in the update method
> >>
> >> (
> >>
> >>
> https://github.com/apache/james-project/pull/2342/files#diff-a7c2a3c5cdb7e4a2914c899409991e27df6b25ad54488f197bc533193e3a03d0R1267
> >>
> >> )
> >>
> >> There is a comment there that UID values are missing from the
> >>
> >> retrieved document and they have to be re-added (otherwise an
> >>
> >> exception about type being NULL is thrown while trying to
> >> update):
> >>
> >> ```
> >>
> >> // somehow the document getting from
> >>
> >> the search lost DocValues data for the uid field, we need to
> >> re-define
> >>
> >> the field with proper DocValues.
> >>
> >> long uidValue =
> >>
> >> doc.getField("uid").numericValue().longValue();
> >>
> >> doc.removeField("uid");
> >>
> >> doc.add(new
> >>
> >> NumericDocValuesField(UID_FIELD, uidValue));
> >>
> >> doc.add(new LongPoint(UID_FIELD,
> >>
> >> uidValue));
> >>
> >> doc.add(new StoredField(UID_FIELD,
> >>
> >> uidValue));
> >>
> >> ```
> >>
> >> It seems that the `ID_FIELD` is somehow also missing (even though
> >> it's
> >>
> >> output in the debugging `.toString()` thus later on the term
> >> search
> >>
> >> with `new Term(ID_FIELD, doc.get(ID_FIELD))` yield 0 results.
> >>
> >> When I re-add the field manually like the UID fileds:
> >>
> >> ```
> >>
> >> final String text = doc.get(ID_FIELD);
> >>
> >> doc.add(new StringField(ID_FIELD, text,
> >>
> >> Store.YES));
> >>
> >> ```
> >>
> >> then subsequent updating works (because the term subsequently
> >> matches
> >>
> >> the ID_FIELD)
> >>
> >> So the question seems to boild down to:
> >>
> >> 1) why we have to re-define those fields as they seem to be
> >> missing
> >>
> >> from the retrieved searched document with:
> >>
> >> ```
> >>
> >> TopDocs docs = searcher.search(queryBuilder.build
> >> [http://searcher.search(queryBuilder.build](), 100000);
> >>
> >> ScoreDoc[] sDocs = docs.scoreDocs;
> >>
> >> for (ScoreDoc sDoc : sDocs) {
> >>
> >> Document doc = searcher.doc(sDoc.doc);
> >>
> >> ````
> >>
> >> 2) if they are missing, why they are included in the document
> >>
> >> (`.toString()`) output?
> >>
> >> On 2024-08-10T12:09:29.000+02:00, Wojtek <[email protected]> wrote:
> >>
> >>> Thank you Gautam!
> >>>
> >>> This works. Now I went back to Lucene and I'm hitting the wall.
> >>>
> >>> In James they set document with "id" being constructed as
> >>>
> >>>>> "flag-<uid>-<uid>" (e.g. "<id:flags-1-1>").
> >>>
> >>> I run the code that updates the documents with flags and
> >>> afterwards
> >>>
> >>> check the result. The code simple code I use new reader from
> >>> the
> >>>
> >>> writer (so it should be OK and should have new state):
> >>>
> >>> ```
> >>>
> >>> try (IndexReader reader =
> >>>
> >>> DirectoryReader.open(luceneMessageSearchIndex.writer)
> >>> [http://DirectoryReader.open(luceneMessageSearchIndex.writer)] [
> >>
> >> http://DirectoryReader.open(luceneMessageSearchIndex.writer)]) {
> >>
> >>> System.out.println("maxDoc: " + reader.maxDoc());
> >>>
> >>> IndexSearcher searcher = new IndexSearcher(reader);
> >>>
> >>> System.out.println("maxDoc (after second flag): " +
> >>>
> >>> reader.maxDoc());
> >>>
> >>> // starting from "1" to avoid main mail document
> >>>
> >>> for (int i = 1; i < reader.maxDoc(); i++) {
> >>>
> >>> System.out.println(reader.storedFields().document(i));
> >>>
> >>> }
> >>>
> >>> var idQuery = new TermQuery(new Term("id", "flags-1-1"));
> >>>
> >>> var search = searcher.search(idQuery
> >>> [http://searcher.search(idQuery] [http://searcher.search
> >>
> >> (idQuery], 10000);
> >>
> >>> System.out.println("Term search: " + search.scoreDocs.length +
> >>>
> >>> " items: " + Arrays.toString(search.scoreDocs));
> >>>
> >>> }
> >>>
> >>> ```
> >>>
> >>> and the output is following:
> >>>
> >>> ```
> >>>
> >>> try (IndexReader reader =
> >>>
> >>> DirectoryReader.open(luceneMessageSearchIndex.writer)
> >>> [http://DirectoryReader.open(luceneMessageSearchIndex.writer)] [
> >>
> >> http://DirectoryReader.open(luceneMessageSearchIndex.writer)]) {
> >>
> >>> System.out.println("maxDoc: " + reader.maxDoc());
> >>>
> >>> IndexSearcher searcher = new IndexSearcher(reader);
> >>>
> >>> System.out.println("maxDoc (after second flag): " +
> >>>
> >>> reader.maxDoc());
> >>>
> >>> // starting from "1" to avoid main mail document
> >>>
> >>> for (int i = 1; i < reader.maxDoc(); i++) {
> >>>
> >>> System.out.println(reader.storedFields().document(i));
> >>>
> >>> }
> >>>
> >>> var idQuery = new TermQuery(new Term("id", "flags-1-1"));
> >>>
> >>> var search = searcher.search(idQuery
> >>> [http://searcher.search(idQuery] [http://searcher.search
> >>
> >> (idQuery], 10000);
> >>
> >>> System.out.println("Term search: " + search.scoreDocs.length +
> >>>
> >>> " items: " + Arrays.toString(search.scoreDocs));
> >>>
> >>> }
> >>>
> >>> ```
> >>>
> >>> So even though I search for term with "flags-1-1" it yields 0
> >>> results
> >>>
> >>> (but there are 2 documents with such ID already).
> >>>
> >>> The gist of the issue is that for some reasons when trying to
> >>> update
> >>>
> >>> flags document instead of updating it (deleting/adding) it's
> >>> only
> >>>
> >>> being added. My reasoning is that for some reason there is an
> >>> issue
> >>>
> >>> with the term matching to the field so the update "fails" (it
> >>> adds new
> >>>
> >>> document for same term) when updating the document:
> >>
> >>
> https://github.com/apache/james-project/pull/2342/files#diff-a7c2a3c5cdb7e4a2914c899409991e27df6b25ad54488f197bc533193e3a03d0R1267
> >>
> >>> The code looks ok, while debuging the term yields: "id:
> >>> flags-1-1"
> >>>
> >>> so it looks OK (but it's only visual string comparison . I
> >>> thought
> >>>
> >>> that it could be the same issue with tokenizer but everywhere
> >>> in the
> >>>
> >>> code StringField is used for the id of the flags:
> >>>
> >>> ```
> >>>
> >>> private Document createFlagsDocument(MailboxMessage message) {
> >>>
> >>> Document doc = new Document();
> >>>
> >>> doc.add(new StringField(ID_FIELD, "flags-" +
> >>>
> >>> message.getMailboxId().serialize() + "-" +
> >>>
> >>> Long.toString(message.getUid().asLong()), Store.YES));
> >>>
> >>> …
> >>>
> >>> ```
> >>>
> >>> So the update based on
> >>>
> >>> ```
> >>>
> >>> new Term(ID_FIELD, doc.get(ID_FIELD))
> >>>
> >>> ```
> >>>
> >>> Should hit that exact document - correct?
> >>>
> >>> Any pointers on how to debug that and see how/where the
> >>> comparison is
> >>>
> >>> done so I could maybe try to figure out why it doesn't match
> >>> the
> >>>
> >>> documents which causes the update to fail will be greatly
> >>> appreciated!
> >>>
> >>> (I've been at it for a couple of days now and while I learned a
> >>> great
> >>>
> >>> deal about Lucene, starting from absolutely zero knowledge, I
> >>> think
> >>>
> >>> I'm in over my head and stepping into Lucene with debugger
> >>> doesn't
> >>>
> >>> help much as I don't know exactly what/where to look for :) )
> >>>
> >>> w.
> >>>
> >>> On 2024-08-10T10:21:21.000+02:00, Gautam Worah
> >>>
> >>> <[email protected]> wrote:
> >>>
> >>>> Hey,
> >>>>
> >>>> Use a StringField instead of a TextField for the title and
> >>>> your
> >>>>
> >>>> test will
> >>>>
> >>>> pass.
> >>>>
> >>>> Tokenization which is enabled for TextFields, is breaking
> >>>> your
> >>>>
> >>>> fancy title
> >>>>
> >>>> into tokens split by spaces, which is causing your docs to
> >>>> not
> >>>>
> >>>> match.
> >>
> >>
> https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/document/StringField.html
> >>
> >>>> Best,
> >>>>
> >>>> Gautam Worah.
> >>>>
> >>>> On Sat, Aug 10, 2024 at 12:05 AM Wojtek <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> Hi Froh,
> >>>>>
> >>>>> thank you for the information.
> >>>>>
> >>>>> I updated the code and re-open the reader - it seems that
> >>>>> the
> >>>>>
> >>>>> update
> >>>>>
> >>>>> is reflected and search for old document doesn't yield
> >>>>> anything
> >>>>>
> >>>>> but
> >>>>>
> >>>>> the search for new term fails.
> >>>>>
> >>>>> I output all documents (there are 2) and the second one has
> >>>>> new
> >>>>>
> >>>>> title
> >>>>>
> >>>>> but when searching for it no document is found even though
> >>>>> it's
> >>>>>
> >>>>> the
> >>>>>
> >>>>> same string that has been used to update the title.
> >>>>>
> >>>>> On 2024-08-10T01:21:39.000+02:00, Michael Froh
> >>>>>
> >>>>> <[email protected]>
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Wojtek,
> >>>>>>
> >>>>>> Thank you for linking to your test code!
> >>>>>>
> >>>>>> When you open an IndexReader, it is locked to the view of
> >>>>>> the
> >>>>>>
> >>>>>> Lucene
> >>>>>>
> >>>>>> directory at the time that it's opened.
> >>>>>>
> >>>>>> If you make changes, you'll need to open a new
> >>>>>> IndexReader
> >>>>>>
> >>>>>> before those
> >>>>>
> >>>>>> changes are visible. I see that you tried creating a new
> >>>>>>
> >>>>>> IndexSearcher, but
> >>>>>>
> >>>>>> unfortunately that's not sufficient.
> >>>>>>
> >>>>>> Hope that helps!
> >>>>>>
> >>>>>> Froh
> >>>>>>
> >>>>>> On Fri, Aug 9, 2024 at 3:25 PM Wojtek <[email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi all!
> >>>>>>>
> >>>>>>> There is an effort in Apache James to update to a more
> >>>>>>>
> >>>>>>> modern
> >>>>>>>
> >>>>>>> version of
> >>>>>>>
> >>>>>>> Lucene (ref:
> >>>>>>>
> >>>>>>> https://github.com/apache/james-project/pull/2342). I'm
> >>>>>>>
> >>>>>>> digging
> >>>>>>>
> >>>>>>> into the
> >>>>>>>
> >>>>>>> issue as other have done
> >>>>>>>
> >>>>>>> but I'm stumped - it seems that
> >>>>>>>
> >>>>>>> `org.apache.lucene.index.IndexWriter#updateDocument`
> >>>>>>>
> >>>>>>> doesn't
> >>>>>>>
> >>>>>>> update
> >>>>>>>
> >>>>>>> the document.
> >>>>>>>
> >>>>>>> Documentation
> >>>>>>>
> >>>>>>> (
> >>
> >>
> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,java.lang.Iterable)
> >>
> >>>>> )
> >>>>>
> >>>>>>> states:
> >>>>>>>
> >>>>>>> Updates a document by first deleting the document(s)
> >>>>>>>
> >>>>>>> containing
> >>>>>>>
> >>>>>>> term
> >>>>>>>
> >>>>>>> and then adding the new
> >>>>>>>
> >>>>>>> document. The delete and then add are atomic as seen by
> >>>>>>> a
> >>>>>>>
> >>>>>>> reader
> >>>>>>>
> >>>>>>> on the
> >>>>>>>
> >>>>>>> same index (flush may happen
> >>>>>>>
> >>>>>>> only after the add).
> >>>>>>>
> >>>>>>> Here is a simple test with it:
> >>
> >>
> https://github.com/woj-tek/lucene-update-test/blob/master/src/test/java/se/unir/AppTest.java
> >>
> >>>>>>> but it fails.
> >>>>>>>
> >>>>>>> Any guidance would be appreciated because I (and
> >>>>>>> others)
> >>>>>>>
> >>>>>>> have
> >>>>>>>
> >>>>>>> been hitting
> >>>>>>>
> >>>>>>> wall with it :)
> >>>>>>>
> >>>>>>> --
> >>>>>>>
> >>>>>>> Wojtek
> >>
>
> >>>>>>>
> >>>>>>> ---------------------------------------------------------------------
> >>>>>>>
> >>>>>>> To unsubscribe, e-mail:
> >>>>>>>
> >>>>>>> [email protected]
> >>>>>>>
> >>>>>>> For additional commands, e-mail:
> >>>>>>>
> >>>>>>> [email protected]
>
>