Re: sync_client behaviour improvement planning

2018-02-27 Thread Bron Gondwana
Followup on this - I pushed a commit to master which included a few
comments highlighting places in the code where changes would help.
ellie is going to work on the changes, to be reviewed by Partha.   None
of these changes make any alterations to the replication protocol, all
they change is:
1) which mailboxes are batched together when multiple mailboxes are
   being checked at once (SYNC GET MAILBOXES)
2) what gets done in interesting cases (e.g. when a folder named in the
   sync log is present on the replica but not on the master)
Cheers,

Bron.


On Mon, 26 Feb 2018, at 18:28, Bron Gondwana wrote:
> Hey - here's me posting something to the public list instead of
> internal FastMail slack.  We've been really bad at making our random
> ruminations public, sorry.> 
> Tomorrow (Tues 27th) 2pm Melbourne time, I'm going to be meeting with
> ellie and maybe Partha in the Melbourne office with a whiteboard and a
> screen to flesh out some ideas for things we can do to fix some of the
> issues that came up after a recent machine failure event at FastMail.> 
> In particular, sync sheer is a very real problem.  The core issue is
> something like this, either:> 
> a)
> 
> sync_log MAILBOX A
> sync_log MAILBOX A
> sync_log MAILBOX B
> 
> Underlying cause - something happened on mailbox A, then mailbox A was
> renamed to B.> 
> Result - if there is a log split between those two lines, the
> sync_client first sees just MAILBOX A, and so it just processes that
> one mailbox.  It sees:> 
> local: MAILBOX A IMAP_MAILBOX_NONEXISTENT
> remote: MAILBOX A exists
> 
> so it issues an UNMAILBOX A, then processes the second file.
> 
> In the second file, it gets:
> 
> local: MAILBOX A IMAP_MAILBOX_NONEXISTENT
> local: MAILBOX B exists
> 
> remote: MAILBOX A IMAP_MAILBOX_NONEXISTENT
> remote: MAILBOX B IMAP_MAILBOX_NONEXISTENT
> 
> So it creates B and copies all the messages again.  This is correct,
> but it's both inefficient and creates a gap where the replica doesn't
> have the messages at all.> 
> 
> b) there are over 1000 mailboxes, and the log file got deduplicated
>and then run in sets, and we had this:> 
> sync_log MAILBOX Z
> sync_log MAILBOX B
> 
> (for a rename of Z to B)
> 
> local: MAILBOX B exists
> remote: MAILBOX B IMAP_MAILBOX_NONEXISTENT
> 
> We upload the entire mailbox.  Later we see both mailbox Z and mailbox
> B, and due to uniqueid duplication and the existence of mailbox B, we
> forget about mailbox Z entirely - leaving a duplicate on the server.
> Until a recent change, this led to real mess when running reconstruct
> caused mailbox Z to get a new uniqueid, just on the end where the
> reconstruct was run.  Run it on both ends later, you could wind up
> with different uniqueids, and replication bails on that because it's
> confused!> 
> 
> The long term solution to all this is to replicate by uniqueid, and
> replicate the name history entirely for each folder such that you can
> calculate the delta and converge on the latest name for the folder in
> split brain.  But for now, maybe we can make this safer.> 
> 
> My initial thought is something like: if the folder exists at one end,
> but not at the other (either way) do a full user sync.> 
> Also, if splitting > 1000 folders in sync_client, make sure we keep
> all the folder for a user in a single batch, so don't split batches
> inside a user.> 
> We may be able to use the tombstone records we've been storing for a
> while to decide whether the lack of a folder is "it used to exist, but
> it doesn't any more" or "it never existed here" - handy for figuring
> out split brain recovery.> 
> Added complications: what about cross-user renames?  What about
> renaming users entirely?> 
> I know some of what Ken has been working on will also possibly
> interact with this, so we're looking for some simple heuristic changes
> that can make everyday situations safer while we wait for the real
> solution.> 
> Bron.
> 
> --
>   Bron Gondwana, CEO, FastMail Pty Ltd
>   br...@fastmailteam.com
> 
> 

--
  Bron Gondwana, CEO, FastMail Pty Ltd
  br...@fastmailteam.com




Re: sync_client behaviour improvement planning

2018-02-26 Thread Bron Gondwana



On Mon, 26 Feb 2018, at 10:37, Sebastian Hagedorn wrote:
> Hi,
> 
> we're only getting started with replication, so I can't contribute any> 
> useful comments, but your post raised two questions for me:
> 
> This only affects the "new", post-3.0 replication protocol,
> right? We're> planning a migration from 2.4 to 3.0 using replication, and I'd
> like to be> aware of all potential issues we might encounter.

A migration will be totally fine regardless.

I realise now the danger of posting about a really rare edge case to the
dev list in a way that implies that it's anything anyone is likely to
hit in regular usage.  It surfaced something like 3 times over tens of
thousands of mailboxes and a full disk event causing some weirdness
around retries in sync.  Cyrus handles full disk quite well these days
(after an event in 2014 where we fixed a bunch of bugs!) but it does
make some things odd.  In particular, it can cause replication to fail
for a mailbox due to disk full, but then later mailboxes in the same
list get processed OK because there was space for that operation.
In summary, don't let your disks get full!

>> Hey - here's me posting something to the public list instead of
>> internal>> FastMail slack.  We've been really bad at making our random
>> ruminations>> public, sorry.
>> Tomorrow (Tues 27th) 2pm Melbourne time, I'm going to be meeting with>> 
>> ellie and maybe Partha in the Melbourne office with a
>> whiteboard and a>> screen to flesh out some ideas for things we can do to fix
>> some of the>> issues that came up after a recent machine failure event at 
>> FastMail.> 
> 
> 
>> b) there are over 1000 mailboxes, and the log file got
>>deduplicated and>>   then run in sets, and we had this:
>> sync_log MAILBOX Z
>> sync_log MAILBOX B
>> 
>> (for a rename of Z to B)
> 
> Where and when does deduplication happen?

Deduplication in this case is sync_action_list_add(mailbox_list,
args[1]) in sync_client.c:do_sync().
If the same item is logged twice in the same log file, then it's only
added once to the list of mailboxes later passed to do_mailboxes.
Bron.


--
  Bron Gondwana, CEO, FastMail Pty Ltd
  br...@fastmailteam.com




Re: sync_client behaviour improvement planning

2018-02-26 Thread Sebastian Hagedorn

Hi,

we're only getting started with replication, so I can't contribute any 
useful comments, but your post raised two questions for me:


This only affects the "new", post-3.0 replication protocol, right? We're 
planning a migration from 2.4 to 3.0 using replication, and I'd like to be 
aware of all potential issues we might encounter.



Hey - here's me posting something to the public list instead of internal
FastMail slack.  We've been really bad at making our random ruminations
public, sorry.
Tomorrow (Tues 27th) 2pm Melbourne time, I'm going to be meeting with
ellie and maybe Partha in the Melbourne office with a whiteboard and a
screen to flesh out some ideas for things we can do to fix some of the
issues that came up after a recent machine failure event at FastMail.





b) there are over 1000 mailboxes, and the log file got deduplicated and
   then run in sets, and we had this:
sync_log MAILBOX Z
sync_log MAILBOX B

(for a rename of Z to B)


Where and when does deduplication happen?
--
Sebastian Hagedorn - Weyertal 121, Zimmer 2.02
Regionales Rechenzentrum (RRZK)
Universität zu Köln / Cologne University - Tel. +49-221-470-89578

pgpoB1WVZ1OML.pgp
Description: PGP signature