I've been trying to rebuild conversations databases at FastMail to add the new 
pre-sorted thread data, and I'm hitting problems with one user who has a 50,000 
message conversation.  Yes, really.

We've talked about splitting conversations at a reasonable size limit, 100, 
250, 512, whatever.  Heaps less than 50k.

I am using thrid and thread as the terminology rather that cid and 
conversation, since we are planning to migrate to that naming to be compatible 
with X-GM-THRID and friends.

Here's what we have now in a user.conversations database file:


$COUNTED_FLAGS  \Draft \Flagged $IsMailingList $IsNotification $HasAttachment 
$HasTD
$FOLDER_NAMES (brong.net!user.brong brong.net!user.brong.#addressbooks.Default 
...)
...
<1854861326.738774.1447379931850.javamail....@ltx1-app10519.prod.linkedin.com>  
0 6fb7d5120b8f9034 1480422704
<1854883027.206388843.1338533743634.javamail.cb...@ednabay.apple.com> 0 
f9bcee0472ff4195 1480422709
<1855265052.15228760.1482500694926.javamail.r...@ninus.ocn.ne.jp> 0 
eda2e001f73a4211 1482500704
<1855785634.142624.1477413929314.javamail....@ela4-app8372.prod.linkedin.com> 0 
806410893c2c8322 1480422756
<1856.121.45.220.115.1183170122.squir...@mail.brong.net>  0 5ead2028d08d72a7 
1480422707
<1856.121.45.220.115.1183170123.squir...@mail.brong.net>  0 5ead2028d08d72a7 
1480422730
<1856079392.76211482248409804.javamail.gilthunderhead_svc@natthundprodapp>  0 
f584cc80149c7c6b 1482248425
...
B00014f0fa1c61ce2 0 (415658 2 2 0 (0 0 0 0 0 0) ((63 410424 1 1 0) (65 415658 1 
1 0)) ((=?utf-8?Q?TECH4U.COM.AU?= NIL TECH4U.COM.AU tech4u.com.au 1405567846 
2)) 4-BayN54LMicroServer$269|GTX7704GBGamingGraphicsCard$479 226322 
((00014f0fa1c61ce21f770dc217003c49945534bb 2 1405567870 3857938598)))
B00016bdc28aa5245 0 (410448 1 1 0 (0 0 0 0 0 0) ((35 410448 1 1 0)) ((root NIL 
root pushme-pullyou.brong.net 1081144803 1)) IAMREPORT 2558 
((00016bdc28aa5245918448904c56ee64ea94aade 1 1081144803 4293330183)))
B000191314cae6c63 0 (410434 1 1 0 (0 0 0 0 0 0) ((27 410434 1 1 0)) (("NZMB 
Diplomacy Judge" NIL judge gem.win.co.nz 1085955619 1)) 
newbies2-S1902MPressfromEtoF 2482 ((000191314cae6c63a1aa90f99cf9d12bf93daa30 1 
1085955665 753666634)))
...
Baa150752ffefb9da 0 (410439 1 1 0 (0 0 0 0 0 0) ((15 410439 1 1 0)) (("Bron 
Gondwana" NIL brong h-r-s.com 1073614242 1)) GoCRF2.1differencesfromGoCRF2.0 
7731 ((aa150752ffefb9da6937f06f0eed2f7d752a5ebb 1 1073614242 3871373670)))
Baa150a2e47254b77 0 (415662 2 2 0 (0 0 0 0 0 0) ((63 410424 1 1 0) (65 415662 1 
1 0)) (("The Economist" NIL TheEconomist execnews.eu 1413923577 2)) 
ExecutiveSubscriptionPlan:12weeksarenowonly$15 41388 
((aa150a2e47254b77bc18aa68f31ddfa109d7a97c 2 1413923585 1048183371)))
...
Bfffe3c76241b60e7 0 (410450 1 1 0 (0 0 1 0 0 0) ((49 410450 1 1 0)) (("Glenn 
Satchell" NIL Glenn.Satchell uniq.com.au 1257130670 1)) SAGE-AUNameChangeSurvey 
4756 ((fffe3c76241b60e70467d613ca7c215eb36d3d07 1 1257130712 933063352)))
Bfffe7e672838ff33 0 (410439 1 1 0 (0 0 1 0 0 0) ((15 410439 1 1 0)) (("Martin 
Schulze" NIL joey infodrom.org 1084909593 1)) DebianWeeklyNews-May18th,2004 
15255 ((fffe7e672838ff335904d710a0899213e19c6100 1 1084911167 530699332)))
...
Fbrong.net!user.brong 0 (477061 8867 8)
Fbrong.net!user.brong.#addressbooks 0 (83829 0 0)
Fbrong.net!user.brong.#addressbooks.Default 0 (476964 894 894)
...
G00007419809aaf1c805bc61a108a8c1be6119bdc:39:67734  
G00014f0fa1c61ce21f770dc217003c49945534bb:63:51202  
G00014f0fa1c61ce21f770dc217003c49945534bb:65:51201  
...
G4207e7a6f8d69b5adc64fcd662624b89ce12553b:46:2726 
G4207fa24a2af8d5e30fddfa1f8d2255fd490eabe:46:6390 
G420809b31bb8763e39aa5b8108c7db0d1583e62b:39:70128[2] 
G42091728414b358a853d509897b2e09a4bcdc2f1:39:64229[3.1.1.1] 
G420985a784ab45dcf9117648dd96365af1678f29:0:65742
...
Gfffe5071df3cb3128c5a28ca80ff26707b88f63e:66:55738[1] 
Gfffe7852e86f37bcd988631e012419515068c89d:12:1692 
Gfffe7e672838ff335904d710a0899213e19c6100:15:268  
Gfffec0b414fc89a13d6cafe6a32fe3e76bdedfe9:43:8337 
Gffff232c08876dbd70f50d281c593a53020ca498:7:14893 


So that's my current conversations database.  Let's look at each field in more 
detail:

'$' keys (variables)
---

$COUNTED_FLAGS => offsets into the counters in the B keys (see later)
$FOLDER_NAMES => mappings from folder numbers in the B keys (see later)

'<' keys (message ids)
---

Key is Message-ID (from Message-Id, In-Reply-To, References and X-ME-Message-Id 
headers)

Value is: version thrid timestamp, space separated atoms

version is always 0
thrid is the rest of the B key (absent leading 'B'), a 64 bit value hex encoded
timestamp is the unix time_t as decimal of the internaldate of the latest 
message seen referencing this message-id.

'B' keys (threads)
---

One key per thread.  The key is 'B' followed by the hex encoded 64 bit thread 
id, which matches the 64 bit value in the cyrus.index for each message in the 
thread.

The value is quite detailed, so let's look at the TECH4U one:

B00014f0fa1c61ce2 0 (415658 2 2 0 (0 0 0 0 0 0) ((63 410424 1 1 0) (65 415658 1 
1 0)) ((=?utf-8?Q?TECH4U.COM.AU?= NIL TECH4U.COM.AU tech4u.com.au 1405567846 
2)) 4-BayN54LMicroServer$269|GTX7704GBGamingGraphicsCard$479 226322 
((00014f0fa1c61ce21f770dc217003c49945534bb 2 1405567870 3857938598)))

value is version followed by a dlist of (HIGHESTMODSEQ COUNT EXISTS UNSEEN 
(FLAGS) (FOLDERS) (SENDERS) SUBJECT SIZE (THREAD))

HIGHESTMODSEQ - the highest modseq of any message in this thread (including 
expunges)
COUNT - the total count of messages in this thread (including expunged messages)
EXISTS - the total count of unexpunged messages in this thread across all 
folders.
UNSEEN - the total number of messages in this thread which do not have the 
\Seen flag set for the owner (system_flags)

FLAGS: for each item in counted flags, the count in that order of unexpunged 
messages which have that flag set (user_flags or system_flags).  $SYSTEM_FLAGS 
is set at database creation, and the only way to change it is to rebuild the 
entire database.

FOLDERS: for each folder a list containing (NUMBER HIGHESTMODSEQ COUNT EXISTS 
UNSEEN). You can see that the TECH4U email has messages in two folders, number 
63 and 65, and in each case there is a single message which still exists. 
NUMBER is an offset into the $FOLDER_NAMES list of this folder, so '0' is my 
INBOX, brong.net!user.brong, and so on. HIGHESTMODSEQ is the highest modseq of 
the messages in each of those folders, so you can see that the more recent 
message is the one in folder 65. The next three number fields are the same as 
for the entire conversation, but only for the counts in this one particular 
folder.

SENDERS: for each sender which has been mentioned in the conversation, the 
name, route, mailbox and domain (see IMAP BODYSTRUCTURE definitions) for each 
sender, followed by the timestamp of the latest internaldate of a message with 
that sender, and a count of total number of messages with that sender.  This is 
used for FastMail XCONV commands, but will not be used for JMAP.

SUBJECT: a normalised version of the subject of every message in this thread.  
The subject is used for every match except that done from X-ME-Message-Id, 
which bypasses normal subject checks.

SIZE: the sum in bytes of the sizes of all the unexpunged messages in this 
entire thread, across all folders.

THREAD: a list of all the messages in this thread (GUID EXISTS INTERNALDATE 
MSGID) where GUID is the digest.sha1 value of the message itself, EXISTS is the 
total number of records across all folders with this particular GUID, 
INTERNALDATE is the maximum of the internaldates of those messages, and MSGID 
is a 32 bit crc32 of the Message-Id header.  Optionally there is a 5th field 
INREPLYTO which is the 32 bit crc32 of the In-Reply-To header, but is only 
present on drafts.  This is used for the JMAP thread sorting algorithm, and is 
stored in pre-sorted order for fast getThreads.


'F' keys (folder metadata)
---

Some simple metadata for each folder.  Let's look at my INBOX:

Fbrong.net!user.brong 0 (477061 8867 8)

VERSION (always 0) followed by a dlist of (HIGHESTMODSEQ EXISTS UNSEEN)

HIGHESTMODSEQ - the highest modseq of any conversation which is present in this 
folder (including expunges)
EXISTS - the number of conversations which have a non-expunged message within 
this folder
UNSEEN - the number of conversations which have both a non-expunged message 
within this folder, and have a non-zero unseen count. NOTE: it is NOT necessary 
for the unseen message to be present in the folder, just that there is an 
unseen message somewhere in the conversation, and that the conversation is also 
in this folder.

'G' keys (guid mappings)

These keys have no value at all, all the data is stored in the key:

G42091728414b358a853d509897b2e09a4bcdc2f1:39:64229[3.1.1.1] 
G420985a784ab45dcf9117648dd96365af1678f29:0:65742

Key is:

'G' GUID ':' NUMBER ':' UID [ '[' PARTSPEC ']' ]

So for a message GUID there is no trailing partspec.  NUMBER is a folder number 
per the $FOLDER_NAMES above, and UID is the value of the UID field in 
cyrus.index for the non-expunged record with this GUID (digest.sha1).  This is 
equivalent to Gmail's X-GM-MSGID, but I'm not going to reuse the term MSGID 
because it's insanely overloaded in the email space.

You can have multiple emails with the same GUID, for example let's find the 
GUIDs of that TECH4U email (note in the THREAD part of the 'B' key it has just 
a single GUID which is present twice...)

G00014f0fa1c61ce21f770dc217003c49945534bb:63:51202  
G00014f0fa1c61ce21f770dc217003c49945534bb:65:51201

And there we have it.  UID 51202 in folder 63 and UID 51201 in folder 65 (yes, 
this is me creating a giant folder and copying a ton of old archived rubbish 
into it, then copying the whole lot to another folder for testing purposes.  
Every email in those two folders has two copies)

...

And that's the structure of the conversations DB as it exists now.  I will 
follow up to this email with a description of how I want to change this to 
support the additional features we want while not losing everything else.

-- 
  Bron Gondwana
  br...@fastmail.fm

Reply via email to