Re: Prepending Xapian Tiers

Dilyan Palauzov Tue, 28 May 2019 13:39:48 -0700

Hello,

so the <userid>.conversations database does, apart of the descriptionsathttps://www.cyrusimap.org/imap/concepts/deployment/databases.html#conversations-userid-conversations, also store per user a G record for each message, mapping the mailboxes where the message is located and the results from Xapian search return Grecords.


Are a G record, GUID and a conversation ID the same thing?

When a message is expunged, are its records from<userid>.conversations removed?

When a message is unexpunged, is it again inserted in<userid>.conversations and referenced in the sync_log_channels:squatter?

squatter has the modes: indexer, search, rolling, synclog, compact,indexfrom (deprecated) and audit. Is search_batchsize used only in theindexer mode, in particular it is not used when squatter -t … -z -X iscalled (compact and reindex simultaneously)?

What is the application for squatter -X (Reindex all messages beforecompacting. This mode reads all the lists of messages indexed by thelisted tiers, and re-indexes them into a temporary database beforecompacting that into place)?

Does it index messages, that were not indexed yet for any reason, orit deletes the database, scans each message again and creates acompact Xapian database?

In the case I described, mailbox receiving reports, having an indexgrow very fast, the cause was a mail loop - a lot of emails arrivingin short time. Once the loop stopped, the index does not exand fasterthan other mailboxes.

So by default for now, unless some extra setup is performed, onlywords in text/plain and text/html get indexed, possibly with headers,and attachments are ignored?


Regards
  Дилян


----- Message from Bron Gondwana <br...@fastmailteam.com> ---------
   Date: Tue, 28 May 2019 18:20:32 +1000
   From: Bron Gondwana <br...@fastmailteam.com>
Subject: Re: Prepending Xapian Tiers
     To: Cyrus Devel <cyrus-devel@lists.andrew.cmu.edu>

On Sat, May 25, 2019, at 22:19, Dilyan Palauzov wrote:
Hello Bron,

For me it is still not absolutely clear how things work with the
Xapian seach backend.

Has search_batchsize any impact during compacting? Does this setting
say how many new messages have to arrive, before indexing them
together in Xapian?
No, search_batchsize just means that when you're indexing a brandnew mailbox with millions of emails, it will put that many emails ata time into a single transactional write to the search index. Duringcompacting, this value is not used.
What are the use-cases to call "squatter -F" (In compact mode, filter
the resulting database to only include messages which are not expunged
in mailboxes with existing name/uidvalidity.) and "squatter -X"
(Reindex all messages before compacting. This mode reads all the
lists of messages indexed by the listed tiers, and re-indexes them
into a temporary database before compacting that into place)?
-F is useful to run occasionally so that your search indexes don'tgrow forever. When emails are expunged, their matching terms aren'tremoved from the xapian indexes, so the database will be bigger thannecessary and when you search for a term which is in deleted emails,it will cause extra IO and conversations DB lookups on the documentid.
Why shall one keep index of deleted and expunged messages and how to
delete references from messages that are both expunged and expired
(after cyr_expire -X0, so removed from the hard disk), but keep the
index to messages that are still on the hard disk, but the user
expunged (double-deleted) them.
I'm not sure I understand your question here. Deleting from xapiandatabases is slow, and particularly with the compacted form, it'sdesigned to be efficient if you don't write to it. Finally, sincewe're de-duplicating by GUID, you would need to do a conversationsdb lookup for every deleted email to check the refcount beforecleaning up the associated record.
How does re-compacting (as in
https://fastmail.blog/2014/12/01/email-search-system/) differ from
re-indexing (as in the manual page of master/squatter)?
"re-compacting" - just means combining multiple databases togetherinto a single compacted database - so the terms in all the sourcedatabases are compacted together into a destination database. I used"re-compacting" because the databases are already all compacted, soit's just combining them rather than gaining the initial spacesaving of the first compact.
"re-indexing" involves parsing the email again and creating termsfrom the source document. When you "reindex" a set of xapiandirectories, the squatter reads the cyrus.indexed.db for each of thesource directories to know which emails it claims to cover, andreads each of those emails in order to index them again.
What gets indexed? For a mailbox receiving only reports (dkim, dmarc,
mta-sts, arf, mtatls), some of which are archived (zip, gzip) the
Xapian index increases very fast.
This would be because these emails often contain unique identifiers,which do indeed take a lot of space. We have had lots of debatesover what exactly should be indexed - for example should you indexsha1 values (e.g. git commit identifiers)? They're completelyrandom, and hence all 40 characters need to be indexed each time!But - it's very handy to be able to search your email for a knownidentifier and see where it was referenced... so we decided toinclude them.
We try not index GPG parts or other opaque blobs where nobody willbe interested in searching for the phrase. Likewise we don't indexMIME boundaries, because they're substructure, not something a userwould know to search for.
We have a work in progress on the master branch to index attachmentsusing an external tool to extract text from the attachment wherepossible, which will increase index sizes even more if enabled!
How can I remove a tier, that contains no data, but is mentioned in
the .xapianactive files?
If you run a compact which includes that tier as a source and not asa destination, then it should remove that tier from every.xapianactive file, at which point you can remove it from yourimapd.conf.
How can I rename a tier?
The whole point of tier names not being paths on disk is so you canchange the disk path without having to rename the tier. Tier namesare IDs, so you're not supposed to rename them.
Having said that, you could add a new tier, compact everythingacross to that tier, then remove the old tier.
How can I efficiently prepend a new tear in the .xapianactive file?
“squatter -t X -z Y -o” does add to the .xapianactive files the
defaultsearhtier, but first has to duplicate with rsync all existing
files. This is not efficient, as big files have to copied.
I'm afraid that's what we have right now. Again, tiers are supposedto be set up at the start and not fiddled with afterwards, so thesystem isn't designed to allow you to quickly add a new tier.
> What it does under the hood is creates a new database and copy all
> the documents over from the source databases, then compress the end
> result into the most compact and fastest xapian format which is
> designed to never write again. This compressed file is then stored
> into the target database name, and in an exclusively locked
> operation the new database is moved into place and the old tiers are
> removed from the xapianactive, such that all new searches look into
> the single destination database instead of the multiple source
> databases.

I do not get this. The amount of tiers to check does not reduce after
doing merging and with three tears the amount of databases is most of
the time three.
Not if you're compacting frequently. We do the following:

* hourly
 - check if tmpfs is > 50% full - quit if not.
 - run squatter -a -o -t temp -z data
* daily
 - regardless of tmpfs size, compact everything on temp and meta down to data
 - squatter -a -t temp,meta -z data
* weekly on Sunday - re-compact all data partitions together
 - squatter -a temp,meta,data -z data
* And finally, once per week once the re-compact is done, check ifwe need to filter and recompact the archive, if so:
 - squatter -a data,archive -z archive -F
Since today is Monday, most users will have two, so the xapianactivemight be something like:
 temp:66 data:52 data:51 archive:2

Later in the week, it might be:
 temp:70 data:66 data:55 data:54 data:53 data:52 data:51 archive:2

And then maybe it will re-compact on Sunday and the user will have
 temp:74 archive:3
What happens, if squatter is terminated during Xapian-compacting,
apart from leaving temporary files? Will rerunning it, just start
from beginning?
The source databases will still be in xapian.active, so yes - a newcompact run will take those same source databases and start again.
Is the idea to have three tiers like this:

At run time, new messages are indexed by Xapian in squatter-rolling
mode on tmpfs/RAM, say on tear T1.
That's certainly what we do, since indexing is too IO-intensive otherwise.
Regalarly, the RAM database is compacted to hard disk (tear T2), say
T1 and T2 are megred into T2. The database on the hard disk is
read-only and search in it is accelerated, as the database is “compact”.
As above - during the week we don't even merge T2 back together, wecompact from T1 to a single small database on T2 - leading tomultiple databases on T2 existing at once.
Only if two compactions happen in parallel of the same sources or
destination, the merge fails and is skipped for that user. The merge
is retried whenever merging T1 and T2 is rescheduled.
Yes - though that's pretty rare on our systems because we use a lockaround the cron task, so the only time this would happen is if youran a manual compaction at the same time as the cron job.
As the databases in T2 get bigger, merging T1 and T2 takes more and
more time. So one more Xapian tear is created, T3. Less regularly,
T2 and T3 are merged into T3. This process takes a while. But
afterwards, T2 is again small, so merging T1 and T2 into T2 is fast.
Yes, that's what we do. This is also the time that we filter the DB,so the T3 database only contains emails which were still alive atthe time of compaction.
How many tears make sense, apart from having one more for power-off events?
Having another one for power off events doesn't make heaps of senseunless you have a fast disk. That's kind of what our "meta"partition is, it's an SSD RAID1 that's faster than the "data"partition which is a SATA spinning RAID1 set.
When we power off a server, we run a task to compact all the temppartitions down - it used to be to meta, but we found thatcompacting straight to data was plenty fast, so we just do that now!
If you power off a server without copying the indexes off tmpfs,they are of course lost. This means that you need to run squatter -ion the server after reboot to index all the recent messages again!So we always run a squatter -i after a crash or power outage beforebringing that server back into production.
Cheers,

Bron.

--
 Bron Gondwana, CEO, FastMail Pty Ltd
 br...@fastmailteam.com



----- End message from Bron Gondwana <br...@fastmailteam.com> -----

Re: Prepending Xapian Tiers

Reply via email to