See below.

> On 05/23/2023 2:14 AM MDT s...@fea.st wrote:
> 
> I had been using the lucene FTS plugin since a decade now and it has done me 
> well. Thought of upgrading to the new & current stuff and came across the 
> flatcurve plugin which seems very promising (xapian on the other hand was 
> creating indexes larger than my mailboxes themselves). I am using following 
> configuration in dovecot.conf:
> 
> fts = flatcurve
> fts_filters_en = lowercase english-possessive stopwords
> fts_languages = en
> fts_tokenizers = generic email-address

^^^ FTS input is being tokenized, so the phrase "/home/johndoe/render.php" will 
be indexed not as a full string but instead separately as "home", "johndoe", 
"render", and "php".

See: 
https://doc.dovecot.org/settings/plugin/fts-plugin/#plugin_setting-fts-fts_tokenizers

This has nothing to do with flatcurve (or any FTS driver) - Dovecot will never 
send the full "/home/johndoe/render.php" to the driver to be indexed.


> fts_autoindex = no
> fts_enforced = yes
> 
> A search command like this:
> 
> doveadm -D search -u j...@doe.com mailbox INBOX SUBJECT 
> "/home/johndoe/render.php"
> 
> should show the messages with subject: "CRON: /home/johndoe/render.php OK" 
> but produces a lot of extra undesired results and I think the second line in 
> this debug output indicates the reason:
> 
> May 23 07:44:13 doveadm(j...@doe.com): Debug: fts-flatcurve(INBOX): Query 
> (hdr_subject:/home/johndoe/render.php*) matches=0 uids=

This is correct, since "/home/johndoe/render.php" was not indexed so there 
should be zero results.


> May 23 07:44:13 doveadm(j...@doe.com): Debug: fts-flatcurve(INBOX): Query 
> (hdr_subject:php* AND hdr_subject:render* AND hdr_subject:johndoe* AND 
> hdr_subject:home*) matches=272 

And this is also correct, as the search phrase is attempted by searching both 
its full string and also all of its tokenized components.  (Both the original 
text and all search terms are processed through the tokenizer before passing to 
a FTS driver.)


> I tried rebuilding the indexes with "fts_flatcurve_substring_search = yes" 
> too but that didn't change anything. It works as expected with lucene plugin 
> because in that case header search is performed via dovecot indexes instead 
> of FTS. May be I am not doing something right in configuring this new FTS? 

I'm not a lucene expert... but with the old lucene plugin, you were almost 
certainly using it without Dovecot tokenization support, since the plugin 
predates it (I think) - using Dovecot tokenization would have required 
'use_libfts' to be present in the fts_lucene setting (which I doubt was ever 
documented).  I believe Dovecot was just doing simple white-space tokenization 
instead, so lucene code/library was likely receiving the full string and doing 
internal tokenization.

michael
_______________________________________________
dovecot mailing list -- dovecot@dovecot.org
To unsubscribe send an email to dovecot-le...@dovecot.org

Reply via email to