On Wed, Mar 18, 2015, at 09:00 AM, Jeroen van Meeuwen (Kolab Systems) wrote: > On 2015-03-14 22:48, Bron Gondwana wrote: > > On Sun, Mar 15, 2015, at 07:18 AM, Jeroen van Meeuwen (Kolab Systems) > > wrote: > >> How, though, do you "ensure" that a mailbox for a new user in such > >> business is created on the same backend as all the other users of > >> said business? > > > > If the business already exists, the create user code will fetch the > > server name from the business database table and make that the > > creation server. > > > > There's a cron job which runs every hour and looks for users who > > aren't on the right server, so if we import a user to the business, > > they get moved. > > > > Right -- so you seem to "agree" that "one business" is limited to "one > backend server", which is precisely what the larger businesses that > are our customers need to work around, when the number of mailboxes is > typically "tens of thousands", and the mechanism you describe "stops > working".
Exactly. It's a limit that we want to avoid, hence looking for a murder-that-scales. > > "happy" is a relative term. You can get most of the benefit from > > using foolstupidclients, but otherwise you're paying O(N) for the > > number of users - and taking 4 times as long to do every list > > command is not ideal. > > Sure -- the majority of the processing delays seem to lay on the > client side taking off the wire what is being dumped on it, however. With over a million mailboxes in a single mailboxes.db I was seeing parsing cost go up, particularly with DLIST. I've written a dlist_sax interface, which cuts out some of the cost, but it's still not free. The easiest way to make things more efficient is not do them at all ;) > You're far better entitled to speak to what is in a mailboxes.db > and/or its in-memory representation by the time you get to scanning > the complete list for items to which a user might have access, I just > have to say we've not found this particular part to be as problematic > for tens of thousands of users (yet). It's going to hurt when you get to millions. That's our issue. If we merged all the mailboxes.db across all our servers into one place, that's a huge database. > For frontends specifically ("discrete murder"), we're able to use > tmpfs for mailboxes.db (and some other stuff of course) solving a > bit of the > I/O constraints, but it's still a list of folders with parameters > containing whether the user has access, and what I meant was perhaps > the list can (in addition) be inverted to be a list of users with > folders (and rights?). That's pretty much exactly the idea. That and avoiding the SPOF that's a murder master right now. They're kind of separate goals, we could do one without the other. > We promote a standby frontend not otherwise used, to become the new > mupdate server. The interruption is a matter of seconds this way, > unless of course you're in the typical stalemate. Hmm.... so maybe it's affordable. It scales up with number-of-servers as well though. Making sure it's up to date costs at least O(number of backends). > > Interesting. Does it also handle the case where the same mailbox > > gets accidentally created on two servers which aren't replica pairs > > though? Or do you get a mailbox fork? > > > > The race condition is not addressed with it, like it is not addressed > currently. I'm not 100% happy living with unaddressed race conditions. Addressing this would be an important part of making FastMail happy to run it. > It solely makes the MUPDATE server not reject the reservation > request from a server that uses the same "servername" if it already > has an entry for the same "servername!partition", so that the > replica successfully creates the local copy -- after which > replication is happy. Yeah, that makes sense. Of course, the backend should probably not be "reserving" so much. There are two things conflated here: 1) I'm running cmd_create in an IMAPd and I want to see if this folder already exists. 2) I'm a replica backend getting a copy of an existing folder (or indeed, a backend which already has a folder) and I'm informing mupdate of the fact. Those two should be treated differently. The first is "does this already exist", which is a legitimate question to ask. The second should always succeed. MUPDATE is a representation of facts, and the backends are the masters of those facts. > So this would build a scenario in which: > > "pair-1-replica-1.example.org" and "pair-1-replica-2.example.org" > present themselves as "pair-1.example.org" > > A DNS IN A RR is created for the fail-over address(es) for "pair- > 1.example.org" and attached to whichever replica in the pair is > considered the active node. > > Both replicas would be configured to replicate to one another, which > works in a PoC scenario but may seem to require lmtpd/AF_INET > delivery. So they both have the same server name in mupdate. My plan is that they have different server names in mupdate. There's a separate channel somehow to say which is the primary out of those servers, which can be switched however (failover tooling) based on which servers are up, but the murder has the facts about where the mailbox really exists. It may even have statuscache. Man, how awesome would distributed statuscache be. So there are multiple records for the same mailbox, with different server names, in the murder DB. > > Sounds compelling. The only problem I can see is if startup is > > really expensive. There's also a problem with "in-memory" with > > separate processes. > > > > I suppose another problem is updates to mailboxes.db, although I > suppose this would mean updating the in-memory lookup tree then > syncing it to disk. Absoultely. > Would using shared memory address the in-memory problem? Admittedly > I've never coded any such, so I'm out of my comfort zone (again). I'm not really comfortable with it either. I'd prefer a mailboxes daemon with its own query language over a unix socket, because it punts a lot of the synchronisation problems. > > The minimum viable product for the fast LIST is basically this: > > > > * convert mupdated to use an sqlite file with the reverse indexes > > built in to it instead of the mailboxes.db > > * convert the LIST code and mboxlist_lookup to use the sqlite file > > * even if not in a murder, also write mboxlist_* updates to the > > sqlite file > > * leave all the existing murder stuff apart from this > > > > sqlite is already embedded for other things, so we don't add any > > dependencies. > > > > I've had many issues with parallel (write) access by multiple > processes to a single sqlite database file, though, and needing to > vacuum the database file after not at all too many mutations > (thousands) as well, in order to keep things from slowing down. Another reason to have a single thread doing the writes :) > Is using SQLite for mailboxes.db not going to enter this sort of > problem space? Perhaps. We'd have to see how it copes in reality of course. FastMail is big enough to test this pretty well! > I can't find the actual patch file, so I must have dropped it, but > it's imap/mupdate.c line 1609 comparing the m->location found, if any, > to the const char *location passed along to cmd_set(), and if they're > (exactly) equal, not bailing. Sure. As I said above, I think the real solution is that sync_server creating a mailbox should always be allowed to assert the fact to the murder. It's not a "please may I", it's a "this is how it is". Bron. -- Bron Gondwana br...@fastmail.fm