Hi,
thank you for reply and apologies for being somewhat "all over the
place".
Regarding "tokenization" - should it happen if I use StringField?
When the document is created (before writing) i see in the debugger
it's not tokenized and is of type StringField:
```
doc = {Document@4830}
"Document<stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>>"
fields = {ArrayList@5920} size = 1
0 = {StringField@5922}
"stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>"
```
But once in the update method (document being retrieved) I see it
changes to StoredField and is already "tokenized":
```
doc = {Document@6526}
"Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>
docValuesType=NUMERIC<uid:1> LongPoint <uid:1> stored<uid:1>>"
fields = {ArrayList@6548} size = 6
0 = {StoredField@6550}
"stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>"
1 = {StoredField@6551}
"stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>"
2 = {StringField@6552}
"stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>"
3 = {NumericDocValuesField@6553} "docValuesType=NUMERIC<uid:1>"
4 = {LongPoint@6554} "LongPoint <uid:1>"
5 = {StoredField@6555} "stored<uid:1>"
```
The code that adds the documents - it's a method implemented in James:
`org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex#add`
(
https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1240
) that looks fairly straightforward:
```
public Mono<Void> add(MailboxSession session, Mailbox mailbox,
MailboxMessage membership) {
return Mono.fromRunnable(Throwing.runnable(() -> {
Document doc = createMessageDocument(session,
membership);
Document flagsDoc = createFlagsDocument(membership);
writer.addDocument(doc);
writer.addDocument(flagsDoc);
}));
}
```
similarly to actual method that creates the flags
(https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1290):
```
private Document createFlagsDocument(MailboxMessage message) {
Document doc = new Document();
doc.add(new StringField(ID_FIELD, "flags-" +
message.getMailboxId().serialize() + "-" +
Long.toString(message.getUid().asLong()), Store.YES));
doc.add(new StringField(MAILBOX_ID_FIELD,
message.getMailboxId().serialize(), Store.YES));
doc.add(new NumericDocValuesField(UID_FIELD,
message.getUid().asLong()));
doc.add(new LongPoint(UID_FIELD, message.getUid().asLong()));
doc.add(new StoredField(UID_FIELD, message.getUid().asLong()));
indexFlags(doc, message.createFlags());
return doc;
}
```
As you can see `StringField` is used when creating the document and to
the best of my knowledge and based on what I was told - it _should_
not be tokenized (?).
Update (in which the document can't be updated because Term seems to
be not finding it) is done in
`org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex#update()`
(https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1259):
```
private void update(MailboxId mailboxId, MessageUid uid, Flags f)
throws IOException {
try (IndexReader reader = DirectoryReader.open(writer)) {
IndexSearcher searcher = new IndexSearcher(reader);
BooleanQuery.Builder queryBuilder = new
BooleanQuery.Builder();
queryBuilder.add(new TermQuery(new
Term(MAILBOX_ID_FIELD, mailboxId.serialize())),
BooleanClause.Occur.MUST);
queryBuilder.add(createQuery(MessageRange.one(uid)),
BooleanClause.Occur.MUST);
queryBuilder.add(new PrefixQuery(new Term(FLAGS_FIELD,
"")), BooleanClause.Occur.MUST);
TopDocs docs = searcher.search(queryBuilder.build(),
100000);
ScoreDoc[] sDocs = docs.scoreDocs;
for (ScoreDoc sDoc : sDocs) {
Document doc = searcher.doc(sDoc.doc);
doc.removeFields(FLAGS_FIELD);
indexFlags(doc, f);
// somehow the document getting from the search
lost DocValues data for the uid field, we need to re-define the field
with proper DocValues.
long uidValue =
doc.getField("uid").numericValue().longValue();
doc.removeField("uid");
doc.add(new NumericDocValuesField(UID_FIELD,
uidValue));
doc.add(new LongPoint(UID_FIELD, uidValue));
doc.add(new StoredField(UID_FIELD, uidValue));
writer.updateDocument(new Term(ID_FIELD,
doc.get(ID_FIELD)), doc);
}
}
}
```
I was wondering if Lucene/writer configuration is not a culprit (that
would result in tokenizing even StringField) but it looks fairly
straightforward:
```
this.directory = directory;
this.writer = new IndexWriter(this.directory,
createConfig(createAnalyzer(lenient), dropIndexOnStart));
```
where createConfig looks like this:
```
protected IndexWriterConfig createConfig(Analyzer analyzer, boolean
dropIndexOnStart) {
IndexWriterConfig config = new IndexWriterConfig(analyzer);
if (dropIndexOnStart) {
config.setOpenMode(OpenMode.CREATE);
} else {
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
}
return config;
}
```
and createAnalyzer like this:
```
protected Analyzer createAnalyzer(boolean lenient) {
if (lenient) {
return new LenientImapSearchAnalyzer();
} else {
return new StrictImapSearchAnalyzer();
}
}
```
On 2024-08-10T21:04:15.000+02:00, Gautam Worah
<[email protected]> wrote:
> Hey,
>
> I don't think I understand the email well but I'll try my best.
>
> In your printed docs, I see that the flag data is still tokenized. See the
>
> string that you printed: DOCS<id:flags-1-1>
>
> stored,indexed,tokenized,omitNorms. What does your code for adding the doc
>
> look like?
>
> Are you using StringField for adding the field to the doc?
>
> I think this is why when you re-add the field with a StringField, the test
>
> works.
>
> Lucene's StandardTokenizer for 9.11 uses the Unicode Text Segmentation
>
> algorithm, as specified in Unicode Standard Annex #29
>
> <http://unicode.org/reports/tr29/> [http://unicode.org/reports/tr29/>];.
>
> That standard contains a "-" as a word breaker.
>
> I guess that is what is breaking your code.
>
> You are using Lucene's NRT for your search. In general, for debugging such
>
> cases, I add an IndexWriter.commit() after you are done updating the doc,
>
> and see if it fixes things.
>
> If it does, then it has something to do with NRT, and deleting docs etc. If
>
> not, then that means that your query/data is wrong somewhere. This is how I
>
> debugged your first problem.
>
> Best,
>
> Gautam Worah.
>
> On Sat, Aug 10, 2024 at 4:17 AM Wojtek <[email protected]> wrote:
>
>> Addendum, output is:
>>
>> ```
>>
>> maxDoc: 3
>>
>> maxDoc (after second flag): 3
>>
>>>
>>>Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
>>>
>>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
>>>
>>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\FLAG>
>>>
>>>> stored<uid:1>>
>>
>>>
>>>Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
>>>
>>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
>>>
>>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\ANSWERED>
>>>
>>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<flags:\FLAG>
>>>
>>>> stored<uid:1>>
>>
>> Term search: 0 items: []
>>
>> ```
>>
>> Though after a bit more digging in I think I found the issue in
>> the
>>
>> James-Lucene code in the update method
>>
>> (
>>
>>
>>https://github.com/apache/james-project/pull/2342/files#diff-a7c2a3c5cdb7e4a2914c899409991e27df6b25ad54488f197bc533193e3a03d0R1267
>>
>> )
>>
>> There is a comment there that UID values are missing from the
>>
>> retrieved document and they have to be re-added (otherwise an
>>
>> exception about type being NULL is thrown while trying to
>> update):
>>
>> ```
>>
>> // somehow the document getting from
>>
>> the search lost DocValues data for the uid field, we need to
>> re-define
>>
>> the field with proper DocValues.
>>
>> long uidValue =
>>
>> doc.getField("uid").numericValue().longValue();
>>
>> doc.removeField("uid");
>>
>> doc.add(new
>>
>> NumericDocValuesField(UID_FIELD, uidValue));
>>
>> doc.add(new LongPoint(UID_FIELD,
>>
>> uidValue));
>>
>> doc.add(new StoredField(UID_FIELD,
>>
>> uidValue));
>>
>> ```
>>
>> It seems that the `ID_FIELD` is somehow also missing (even though
>> it's
>>
>> output in the debugging `.toString()` thus later on the term
>> search
>>
>> with `new Term(ID_FIELD, doc.get(ID_FIELD))` yield 0 results.
>>
>> When I re-add the field manually like the UID fileds:
>>
>> ```
>>
>> final String text = doc.get(ID_FIELD);
>>
>> doc.add(new StringField(ID_FIELD, text,
>>
>> Store.YES));
>>
>> ```
>>
>> then subsequent updating works (because the term subsequently
>> matches
>>
>> the ID_FIELD)
>>
>> So the question seems to boild down to:
>>
>> 1) why we have to re-define those fields as they seem to be
>> missing
>>
>> from the retrieved searched document with:
>>
>> ```
>>
>> TopDocs docs = searcher.search(queryBuilder.build
>> [http://searcher.search(queryBuilder.build](), 100000);
>>
>> ScoreDoc[] sDocs = docs.scoreDocs;
>>
>> for (ScoreDoc sDoc : sDocs) {
>>
>> Document doc = searcher.doc(sDoc.doc);
>>
>> ````
>>
>> 2) if they are missing, why they are included in the document
>>
>> (`.toString()`) output?
>>
>> On 2024-08-10T12:09:29.000+02:00, Wojtek <[email protected]> wrote:
>>
>>> Thank you Gautam!
>>>
>>> This works. Now I went back to Lucene and I'm hitting the wall.
>>>
>>> In James they set document with "id" being constructed as
>>>
>>>>> "flag-<uid>-<uid>" (e.g. "<id:flags-1-1>").
>>>
>>> I run the code that updates the documents with flags and
>>> afterwards
>>>
>>> check the result. The code simple code I use new reader from
>>> the
>>>
>>> writer (so it should be OK and should have new state):
>>>
>>> ```
>>>
>>> try (IndexReader reader =
>>>
>>> DirectoryReader.open(luceneMessageSearchIndex.writer)
>>> [http://DirectoryReader.open(luceneMessageSearchIndex.writer)] [
>>
>> http://DirectoryReader.open(luceneMessageSearchIndex.writer)]) {
>>
>>> System.out.println("maxDoc: " + reader.maxDoc());
>>>
>>> IndexSearcher searcher = new IndexSearcher(reader);
>>>
>>> System.out.println("maxDoc (after second flag): " +
>>>
>>> reader.maxDoc());
>>>
>>> // starting from "1" to avoid main mail document
>>>
>>> for (int i = 1; i < reader.maxDoc(); i++) {
>>>
>>> System.out.println(reader.storedFields().document(i));
>>>
>>> }
>>>
>>> var idQuery = new TermQuery(new Term("id", "flags-1-1"));
>>>
>>> var search = searcher.search(idQuery
>>> [http://searcher.search(idQuery] [http://searcher.search
>>
>> (idQuery], 10000);
>>
>>> System.out.println("Term search: " + search.scoreDocs.length +
>>>
>>> " items: " + Arrays.toString(search.scoreDocs));
>>>
>>> }
>>>
>>> ```
>>>
>>> and the output is following:
>>>
>>> ```
>>>
>>> try (IndexReader reader =
>>>
>>> DirectoryReader.open(luceneMessageSearchIndex.writer)
>>> [http://DirectoryReader.open(luceneMessageSearchIndex.writer)] [
>>
>> http://DirectoryReader.open(luceneMessageSearchIndex.writer)]) {
>>
>>> System.out.println("maxDoc: " + reader.maxDoc());
>>>
>>> IndexSearcher searcher = new IndexSearcher(reader);
>>>
>>> System.out.println("maxDoc (after second flag): " +
>>>
>>> reader.maxDoc());
>>>
>>> // starting from "1" to avoid main mail document
>>>
>>> for (int i = 1; i < reader.maxDoc(); i++) {
>>>
>>> System.out.println(reader.storedFields().document(i));
>>>
>>> }
>>>
>>> var idQuery = new TermQuery(new Term("id", "flags-1-1"));
>>>
>>> var search = searcher.search(idQuery
>>> [http://searcher.search(idQuery] [http://searcher.search
>>
>> (idQuery], 10000);
>>
>>> System.out.println("Term search: " + search.scoreDocs.length +
>>>
>>> " items: " + Arrays.toString(search.scoreDocs));
>>>
>>> }
>>>
>>> ```
>>>
>>> So even though I search for term with "flags-1-1" it yields 0
>>> results
>>>
>>> (but there are 2 documents with such ID already).
>>>
>>> The gist of the issue is that for some reasons when trying to
>>> update
>>>
>>> flags document instead of updating it (deleting/adding) it's
>>> only
>>>
>>> being added. My reasoning is that for some reason there is an
>>> issue
>>>
>>> with the term matching to the field so the update "fails" (it
>>> adds new
>>>
>>> document for same term) when updating the document:
>>
>>
>>https://github.com/apache/james-project/pull/2342/files#diff-a7c2a3c5cdb7e4a2914c899409991e27df6b25ad54488f197bc533193e3a03d0R1267
>>
>>> The code looks ok, while debuging the term yields: "id:
>>> flags-1-1"
>>>
>>> so it looks OK (but it's only visual string comparison . I
>>> thought
>>>
>>> that it could be the same issue with tokenizer but everywhere
>>> in the
>>>
>>> code StringField is used for the id of the flags:
>>>
>>> ```
>>>
>>> private Document createFlagsDocument(MailboxMessage message) {
>>>
>>> Document doc = new Document();
>>>
>>> doc.add(new StringField(ID_FIELD, "flags-" +
>>>
>>> message.getMailboxId().serialize() + "-" +
>>>
>>> Long.toString(message.getUid().asLong()), Store.YES));
>>>
>>> …
>>>
>>> ```
>>>
>>> So the update based on
>>>
>>> ```
>>>
>>> new Term(ID_FIELD, doc.get(ID_FIELD))
>>>
>>> ```
>>>
>>> Should hit that exact document - correct?
>>>
>>> Any pointers on how to debug that and see how/where the
>>> comparison is
>>>
>>> done so I could maybe try to figure out why it doesn't match
>>> the
>>>
>>> documents which causes the update to fail will be greatly
>>> appreciated!
>>>
>>> (I've been at it for a couple of days now and while I learned a
>>> great
>>>
>>> deal about Lucene, starting from absolutely zero knowledge, I
>>> think
>>>
>>> I'm in over my head and stepping into Lucene with debugger
>>> doesn't
>>>
>>> help much as I don't know exactly what/where to look for :) )
>>>
>>> w.
>>>
>>> On 2024-08-10T10:21:21.000+02:00, Gautam Worah
>>>
>>> <[email protected]> wrote:
>>>
>>>> Hey,
>>>>
>>>> Use a StringField instead of a TextField for the title and
>>>> your
>>>>
>>>> test will
>>>>
>>>> pass.
>>>>
>>>> Tokenization which is enabled for TextFields, is breaking
>>>> your
>>>>
>>>> fancy title
>>>>
>>>> into tokens split by spaces, which is causing your docs to
>>>> not
>>>>
>>>> match.
>>
>>
>>https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/document/StringField.html
>>
>>>> Best,
>>>>
>>>> Gautam Worah.
>>>>
>>>> On Sat, Aug 10, 2024 at 12:05 AM Wojtek <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Froh,
>>>>>
>>>>> thank you for the information.
>>>>>
>>>>> I updated the code and re-open the reader - it seems that
>>>>> the
>>>>>
>>>>> update
>>>>>
>>>>> is reflected and search for old document doesn't yield
>>>>> anything
>>>>>
>>>>> but
>>>>>
>>>>> the search for new term fails.
>>>>>
>>>>> I output all documents (there are 2) and the second one has
>>>>> new
>>>>>
>>>>> title
>>>>>
>>>>> but when searching for it no document is found even though
>>>>> it's
>>>>>
>>>>> the
>>>>>
>>>>> same string that has been used to update the title.
>>>>>
>>>>> On 2024-08-10T01:21:39.000+02:00, Michael Froh
>>>>>
>>>>> <[email protected]>
>>>>>
>>>>> wrote:
>>>>>
>>>>>> Hi Wojtek,
>>>>>>
>>>>>> Thank you for linking to your test code!
>>>>>>
>>>>>> When you open an IndexReader, it is locked to the view of
>>>>>> the
>>>>>>
>>>>>> Lucene
>>>>>>
>>>>>> directory at the time that it's opened.
>>>>>>
>>>>>> If you make changes, you'll need to open a new
>>>>>> IndexReader
>>>>>>
>>>>>> before those
>>>>>
>>>>>> changes are visible. I see that you tried creating a new
>>>>>>
>>>>>> IndexSearcher, but
>>>>>>
>>>>>> unfortunately that's not sufficient.
>>>>>>
>>>>>> Hope that helps!
>>>>>>
>>>>>> Froh
>>>>>>
>>>>>> On Fri, Aug 9, 2024 at 3:25 PM Wojtek <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all!
>>>>>>>
>>>>>>> There is an effort in Apache James to update to a more
>>>>>>>
>>>>>>> modern
>>>>>>>
>>>>>>> version of
>>>>>>>
>>>>>>> Lucene (ref:
>>>>>>>
>>>>>>> https://github.com/apache/james-project/pull/2342). I'm
>>>>>>>
>>>>>>> digging
>>>>>>>
>>>>>>> into the
>>>>>>>
>>>>>>> issue as other have done
>>>>>>>
>>>>>>> but I'm stumped - it seems that
>>>>>>>
>>>>>>> `org.apache.lucene.index.IndexWriter#updateDocument`
>>>>>>>
>>>>>>> doesn't
>>>>>>>
>>>>>>> update
>>>>>>>
>>>>>>> the document.
>>>>>>>
>>>>>>> Documentation
>>>>>>>
>>>>>>> (
>>
>>
>>https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,java.lang.Iterable)
>>
>>>>> )
>>>>>
>>>>>>> states:
>>>>>>>
>>>>>>> Updates a document by first deleting the document(s)
>>>>>>>
>>>>>>> containing
>>>>>>>
>>>>>>> term
>>>>>>>
>>>>>>> and then adding the new
>>>>>>>
>>>>>>> document. The delete and then add are atomic as seen by
>>>>>>> a
>>>>>>>
>>>>>>> reader
>>>>>>>
>>>>>>> on the
>>>>>>>
>>>>>>> same index (flush may happen
>>>>>>>
>>>>>>> only after the add).
>>>>>>>
>>>>>>> Here is a simple test with it:
>>
>>
>>https://github.com/woj-tek/lucene-update-test/blob/master/src/test/java/se/unir/AppTest.java
>>
>>>>>>> but it fails.
>>>>>>>
>>>>>>> Any guidance would be appreciated because I (and
>>>>>>> others)
>>>>>>>
>>>>>>> have
>>>>>>>
>>>>>>> been hitting
>>>>>>>
>>>>>>> wall with it :)
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Wojtek
>>
>>>>>>>
>>>>>>>---------------------------------------------------------------------
>>>>>>>
>>>>>>> To unsubscribe, e-mail:
>>>>>>>
>>>>>>> [email protected]
>>>>>>>
>>>>>>> For additional commands, e-mail:
>>>>>>>
>>>>>>> [email protected]