Re: [Dovecot] Replication plans

2007-06-05 Thread Troy Benjegerdes
On Tue, Jun 05, 2007 at 09:56:29PM +0300, Timo Sirainen wrote:
 On Tue, 2007-05-22 at 09:58 -0500, Troy Benjegerdes wrote:
  Best case, when all the nodes, and the network is up, locking latency
  shouldn't be much longer than say twice the RTT. But what really
  matters, and causes all the nasty bugs that even single-master
  replication systems have to deal with is the *worst case* latency. So
  everything is going along fine, and then due to a surge in incoming
  spam, one of your switches starts dropping 2% of the packets, and the
  server holding a lock starts taking 50ms instead of 1ms to respond to an
  incoming packet. 
  
  Now your previous lock latency of 1ms could easily extend into seconds if
  a couple of responses to lock requests don't get through. And your 16
  node imap cluster is now 8 times slower than a single server, instead of
  8 times faster ;)
 
 If you're so worried about that, you could create another internal
 network just for replication :)

Things are worse if the internal network for replication is the one that
started having errors ;) .. Your machine is accessible to the world, but
you can't reliably communicate to get a lock

  The nasty part about this for imap is that we can't ever have a UID be
  handed out without *confirming* that it's been replicated to another
  server before sending out the packet. Otherwise you can get in the
  situation where node A sends out a new UID to a client out it's public
  NIC card, while in the meantime, it's internal NIC melted so the update
  never got propagated, so node B,C, and D  decides ooops, node A is
  dead, we are stealing his lock, and B takes over the lock and allocates
  the same UID to a different message, and now the CEO didn't get that
  notice from the SEC to save all his emails.
 
 When the servers sync up again they'll notice the duplicated UID and
 both of the emails will be assigned a new UID to fix the situation. This
 conflict handling will have to be done in any case.

That sounds like a pretty clean solution, and makes a lot of the things
that make replication hard go away.


Re: [Dovecot] Replication plans

2007-05-23 Thread J . Wendland
Hi list,

 OpenLDAP uses another strategy, which is more robust aka needs less 
 fragile interaction between the servers.

We have been thinking very long about replication. The requirement
is to have a backup computing center in distant location, so
replication has to work over a WAN connection (latency!) and must
be able to recover from failures. This in mind we came to the
conclusion that the strategy OpenLDAP is using would be the best
to come up with and would be not too difficult to implement (we
even started experiments which showed that this would be feasible).
BTW, Oracle's replication mechanism (DataGuard) also works in a
similar way, ie. by transferring the transaction logs to the backup
and replaying them there.

Cheers,
  Jörg


Re: [Dovecot] Replication plans

2007-05-22 Thread Troy Benjegerdes
  This increases communication and locking significantly. The locking alone 
  will likely be a choke point. 
 
 My plan would require the locking only when the mailbox is being updated
 and the global lock isn't already owned by the server. If you want to
 avoid different servers from constantly stealing the lock from each
 others, use different ways to make sure that the mailbox normally isn't
 modified from more than one server.
 
 I don't think this will be a big problem even if multiple servers are
 modifying the same mailbox, but it depends entirely on the extra latency
 caused by the global locking. I don't know what the latency will be
 until it can be tested, but I don't think it should be much more than
 what a simple ping would give over the same network.

Best case, when all the nodes, and the network is up, locking latency
shouldn't be much longer than say twice the RTT. But what really
matters, and causes all the nasty bugs that even single-master
replication systems have to deal with is the *worst case* latency. So
everything is going along fine, and then due to a surge in incoming
spam, one of your switches starts dropping 2% of the packets, and the
server holding a lock starts taking 50ms instead of 1ms to respond to an
incoming packet. 

Now your previous lock latency of 1ms could easily extend into seconds if
a couple of responses to lock requests don't get through. And your 16
node imap cluster is now 8 times slower than a single server, instead of
8 times faster ;)

The nasty part about this for imap is that we can't ever have a UID be
handed out without *confirming* that it's been replicated to another
server before sending out the packet. Otherwise you can get in the
situation where node A sends out a new UID to a client out it's public
NIC card, while in the meantime, it's internal NIC melted so the update
never got propagated, so node B,C, and D  decides ooops, node A is
dead, we are stealing his lock, and B takes over the lock and allocates
the same UID to a different message, and now the CEO didn't get that
notice from the SEC to save all his emails.


Once you decide you want replication, you pretty much have to go all the
way to synchronous replication, and now you have a learning curve and
complexity issue that's going to be there whether it's dovecot
replication, or a cluster filesystem that's doing the dirty work for
you.


-- 
--
Troy Benjegerdes'da hozer'[EMAIL PROTECTED]  

Somone asked me why I work on this free (http://www.fsf.org/philosophy/)
software stuff and not get a real job. Charles Shultz had the best answer:

Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't. That's why
I draw cartoons. It's my life. -- Charles Shultz


Re: [Dovecot] Replication plans

2007-05-21 Thread Steffen Kaiser

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Thu, 17 May 2007, Timo Sirainen wrote:

Hello,

OpenLDAP uses another strategy, which is more robust aka needs less 
fragile interaction between the servers.


OpenLDAP stores any transaction into a replication log file, after it has 
been processed locally. The repl file is frequently read by another demaon 
(slurp) and forwarded to the slaves. If the forward to one particular 
slaves fails, the transaction is placed into a host-specific rejection log 
file. OpenLDAP uses a feature, that any modifiation (update, add new, 
delete) can be expressed in command syntax, hence, the slave speaks 
the same protocol as the master.


The biggest advantage is that the transation already succeeded for the 
master and is replayed to the slaves. So when pushing the message to the 
slave, you need not fiddle with decreasing UIDs for instance, because to 
perform a partial sync of a known-good-state mailbox. And the transaction 
is saved in the replay log file. In case the master process/host is 
crashing.


I think, if the replication log is replayed fastly - e.g. by tailing the 
file, you can effectively separate the problem of non-reacting slaves and 
re-replay for slaves that come up later and have quasi-immediate updates 
of the slaves. Also, because one replay agent per slave can be used, all 
interaction to the slave is sequential. You wrote something about avoiding 
files, what about making the repl log file a socket; so the frontend is 
dealing with the IMAP client and forwards the request to the replayer and 
is, therefore, not effected by probably bad network issues to the slaves.


You cannot have the advantage of OpenLDAP to use the same IMAP protocol 
for the slaves, because of some restrictions. You want to have a 100% 
replica, as I understand it, hence, the UIDs et al need to be equal.

So you will probably need to improve the IMAP protocol by:

APPEND/STORE with UID.

The message will be spooled with the same UID on the slave. As you've 
wrote, it SHOULD NOT happen, that the slave fails, but if the operation is 
impossible, due to some wicked out-of-sync state, the slave reports back 
and requests a full resync. The replay agent would then drop any changes 
in the transaction for the specific host and mailbox and syncs the whole 
mailbox with the client, probably using something like rsync?


BTW: It would be good, if the resyncs can be initiated on admin request, 
too ;-)


For the dial-up situation you've mentioned (laptop with own server), the 
replay agent would store any changes until the slave come up, properly by 
contacting the Master Dovecot process and issues something like SMTP 
ETRN.


When the probability is low that the same mailbox is accessable on 
different hosts (for shared folders multiple accesses are likely), this 
method should be even work well in multi-master situations. You'll have to 
run replay agents on all the servers then.


To get the issues with the UIDs correct, when one mailbox is in use on 
different hosts, you thought about locks. But is this necessary?


If only the UIDs are the problem, then with a method to mark an UID as 
taken throughout multiple masters, all masters will have the same UID 
level, not necessarily with the message data already associated, meaning:


master A is to APPEND a new message to mailbox M,
it sends all other masters the info: want to take UID U.
If the UID is already taken by another master B, B replies UID taken, 
then the mailboxes are out-of-sync and need a full resync.
If a master B receives a request for UID U, it has sent a election for 
itself, masters AB are ranked, e.g. by IP address, so master B 
replies either you may take it or I want to take it. In first case, 
master B re-issues its request for another UID U2 and marks UID U as 
taken.

Otherwise master B marks UID U as taken in mailbox M.

If master A got the OK for UID U, it allocates it finally and accepts 
the message from the IMAP/SMTP client and places the message into the 
replay log file.


When now a master B gets a transaction STORE message as UID U being 
taken, but no message, yet, the master accepts the transaction.



doesn't make sure that messages themselves aren't lost. If the master's
hard disk failed completely and it can't be resynced anymore, the
messages saved there are lost. This could be avoided by making the
saving wait until the message is saved to slave:

 - save mail to disk, and at the same time also send it to slave
 - allocate UID(s) and tell to slave what they were, wait for mail
saved reply from slave before committing the messages to mailbox
permanently


Well, this assumes that everything is functional hyper-good.
To preseve a hard disk should not be the issue of Dovecot, but the 
underlaying filesystem, IMHO. (aka RAID, SAN)


If you want to wait for each transaction, that all slaves gave their OK, 
you'll have problems with the slave server on laptop scenario.

Then you'll need to 

Re: [Dovecot] Replication plans

2007-05-21 Thread Francisco Reyes

Troy Benjegerdes writes:


But that's currently not *really* replicated. The real question I guess
is why not use a cluster/distributed/san filesystem like AFS, GFS,


Because those distribute filesystems may be more difficult to setup, more 
difficult to maintain and may be less portable than a dovecot solution.


For example Gluster sounds really like a great distributed filesystem, but 
it currently does not work in FreeBSD.


If the company I work for wanted to use Gluster we would need to either 
learn Linux or hire someone to setup  and maintain the linux boxes for us.



I'd suggest that multi-master replication be implemented on a
per-mailbox basis


I suggest we forget about multi-master for now. :-)
Some of us would rather see something sooner rather than later.


I like the idea of dovecot having a built-in proxy/redirect ability so a
cluster can be done with plain old round-robin DNS.


Round-robin DNS is ok for a small setup, but not good enough for larger 
setups.



most cases, if there's a dead server in the round-robin DNS pool, most
clients will retry automatically, or the user will get an error



And you will get dozens of calls of users asking what is going on, and why 
are things slow, etc. etc... and there goes half your morning/afternoon.


Even with a small TTL outlook is so flaky that often times after a brief 
problem outlook needs to be rebooted.


Where I work we may have a 15 minute problem that takes us 2 hours of calls 
to handle.. in large part because of outlook or people just wanted to know 
what happened.. and expecting a call back with an explanation.


Failover needs to be seemless and without error.
Either have a proxy from dovecot in front or a load balancer.


Re: [Dovecot] Replication plans

2007-05-21 Thread Francisco Reyes

Timo Sirainen writes:


Then there are also people who would want to run Dovecot on their laptop
and have it synchronize with the main server whenever network connection
is available.


YES!

I had not thought of that, but that would be killer.. although that would be 
multi-master which I think would be really difficult to implement.. :-(



I was hoping to find an existing global lock manager. I think there
exist some. If not, then I'll have to spend some more time thinking
about it. Anyway everyone don't want to use AFS, so I can't rely on
it :)



It is not that people dont want.. it is that some times they can't.
For example where I work we are keeping a close eye on Gluster, but it 
currently does not work in FreeBSD. Some of the other distrbibuted 
filesystems are either not as mature in FreeBSD or are non existent.   


Re: [Dovecot] Replication plans

2007-05-21 Thread Francisco Reyes

Timo Sirainen writes:


Master keeps all the changes in memory until slave has replied that it
has committed the changes. If the memory buffer gets too large (1MB?)


Does this mean that in case of a crash all that would be lost?
I think the cache should be smaller.  


because the slave is handling the input too slowly or because it's
completely dead, the master starts writing the buffer to a file. Once
the slave is again responding the changes are read from the file and
finally the file gets deleted.


Good.


If the file gets too large (10MB?) it's deleted and slave will require a
resync.


Don't agree.
A large mailstore with Gigabytes worth of mail would benefit from having 
10MB synced... instead of re-starting from scratch.



Master always keeps track of user/mailbox - last transaction
sequence in memory. When the slave comes back up and tells the master
its last committed sequence, this allows the master to resync only those
mailboxes that had changed.


I think a user configurable option to decide how large the sync files can 
grow to would be most flexible.




the whole slave. Another way would be to just mark that one user or
mailbox as dirty and try to resync it once in a while.


That sounds better.
A full resync can be very time consuming with a large and busy mailstore.
Not only the full amount of data needs to be synced, but new changes too. 


queues. The communication protocol would be binary


Because? Performance? Wouldn't that make debugging more difficult?


dovecot-replication process would need read/write access to all users'
mailboxes. So either it would run as root or it would need to have at
least group-permission rights to all mailboxes. A bit more complex
solution would be to use multiple processes each running with their own
UIDs, but I think I won't implement this yet.


For now pick the easiest approach to get this first version out.
This will allow testers to have something to stress test. Better to have 
some basics out.. get feedback.. than to try to go after a more complex 
approach; unless you believe the complex approach is the ultimate long term 
best method.
  

But it should be possible to split users into multiple slaves (still one
slave/user). The most configurable way to do this would be to have
userdb return the slave host.


Why not just have 1 slave process per slave machine?



This is the most important thing to get right, and also the most complex
one. Besides replicating mails that are being saved via Dovecot, I think
also externally saved mails should be replicated when they're first
seen. This is somewhat related to doing an initial sync to a slave.


Why not go with a pure log replication scheme?
this way you basically have 3 processes.

1- The normal, currently existing programs. Add logs to the process
2- A Master replication process which listens for clients requesting for 
info.
3- The slave processes that request infomation and write it to the slave 
machines.


With this approach you can basically break it down into logical units of 
code which can be tested and debugged. Also helps when you need to worry 
about security and the level at which each component needs to work.
  

The biggest problem with saving is how to robustly handle master
crashes. If you're just pushing changes from master to slave and the
master dies, it's entirely possible that some of the new messages that
were already saved in master didn't get through to slave.


With my suggested method that, in theory, never happen.
A message doesn't get accepted unless the log gets written (if replication 
is on).


If the master dies, when it gets restarted it should be able to continue.   


  - If save/copy is aborted, tell the slave to decrease the UID counter
by the number of aborted messages.


Are you planning to have a single slave? Or did you plan to allow multiple 
slaves? If allowing multiple slaves you will need to keep track at which 
point in the log each slave is. An easier approach is to have a setting 
based on time for how long to allow the master to keep logs.



Solution here would again be that before EXPUNGE notifications are sent
to client we'll wait for reply from slave that it had also processed the
expunge.


From all your descriptions it sounds as if you are trying to do Synchronous 

replicat. What I suggested is basically to use Asynchronous replication.
I think synchronous replication is not only much more difficult, but also 
much more difficult to debug and maintain in working order over changes.


Master/multi-slave
--

Once the master/slave is working, support for multiple slaves could be
added.


With the log shipping method I suggested multi-slave should not be much more 
coding to do.


In theory you could put more of the burden on the slaves to ask for their 
last transaction ID.. that they got onward.. so the master will not need to 
know anything extra to handle multi-slaves.  




After master/multi-slave is working, we're 

Re: [Dovecot] Replication plans

2007-05-21 Thread Francisco Reyes

Timo Sirainen writes:


Why not go with a pure log replication scheme?
this way you basically have 3 processes.

1- The normal, currently existing programs. Add logs to the process
2- A Master replication process which listens for clients requesting for 
info.
3- The slave processes that request infomation and write it to the slave 
machines.


With this approach you can basically break it down into logical units of 
code which can be tested and debugged. Also helps when you need to worry 
about security and the level at which each component needs to work.


I'm not completely sure what you mean by these. Basically the same as
what I said, except just have imap/deliver simply send the changes
without any waiting?


I am basically suggesting to log all the changes to a log(s) and have a 
separate program handle passing on the information to the slaves.
 

But isn't the point of the master/slave that the slave would switch on
if the master dies?


Yes.


If you switch slave to be the new master, it doesn't
matter if the logs were written to master's disk. Sure the message could
come back when the master is again brought back


I was thinking that once a master dies, next it it comes back it would be a 
slave and no longer master. This would, unfortunately imply, that any 
transactions that were not copied over would be lost.



I don't understand what you mean. Sure the logs could timeout at some
point (or shouldn't there be some size limits anyway?), but you'd still
need to keep track of what different slaves have seen.


I was thinking that it would be the slaves job to ask for data.
Example pseudo transactions.

Master gets transactions 1 through 100
Slave(s) start from scratch and ask from transactions starting at 1.
Say one slave, let's call it A, dies at transaction 50 and  another slave, 
say B, continues and goes all the way to 100.


More transactions come and now we are up to 150.
Slave B will ask for anything after 100.
When Slave A comes back it would ask for anything after 50.

The master simply gets a request for transactions after a given number so it 
doesn't need to keep track the status of each slave.



possible. It would be easier to implement much simpler master-slave
replication, but in error conditions that would almost guarantee that
some messages get lost. I want to avoid that.



If you think Synchronous replication is doable.. go for it. :-)
it is a tradeoff of reliability vs speed.
Specially as the number of slaves grow the communication will grow.


Sure. I wasn't planning on implementing multi-slave or multi-master
before the master/slave was fully working and stress testing showing
that no mails get lost even if master is killed every few seconds (and
each crash causing master/slave to switch roles randomly).


Great idea.



Re: [Dovecot] Replication plans

2007-05-18 Thread Troy Benjegerdes
On Fri, May 18, 2007 at 11:41:46AM +0900, Christian Balzer wrote:
 On Thu, 17 May 2007 19:17:25 +0300 Timo Sirainen [EMAIL PROTECTED] wrote:
 
  On Thu, 2007-05-17 at 10:04 -0500, Troy Benjegerdes wrote:
   But that's currently not *really* replicated. The real question I guess
   is why not use a cluster/distributed/san filesystem like AFS, GFS,
   Lustre, GPFS to handle the actual data, and specify that replication
   only works for maildir or other single file per message mailbox
   formats.
  
  This already works, but people don't seem to want to use it. There are
  apparently some stability, performance, complexity and whatever
  problems. And if you're planning on replicating to a remote location far
  away they're really slow. But I haven't personally tried any of them, so
  all I can say is that people are interested in getting replication that
  doesn't require clustered filesystems, so I'll build it for them. :)
  
 I for one would rather pay you for not re-inventing the wheel, but
 if people with actual access to funds are willing to pay you for this
 then I guess take the money and run is the thing to do. :-p
 
I'm going to throw out a warning that it's my feeling that replication
has ended many otherwise worthwhile projects. Once you go down that
rabbit hole, you end up finding out the hard way that you just can't
avoid the stability, performance, complexity, and whatever problems.

If you take the money and run, just be aware of the complexity and
customer expectations you are getting yourself into ;)

 Yes, all these FS based approaches currently have one or more of
 the issues Timo lists. The question of course is, will a replicated
 dovecot be less complex, slow, etc.
 For people with money, there are enterprise level replicated file
 systems and/or hardware replicated SANs (remote locations, too).
 For people w/o those funds there are the above approaches (which
 despite all their shortcomings can work, right now) and of course 
 one frequently overlooked but perfectly fitting solution, DRBD.
 For the ZFS fanbois, there ware ways to make it clustered/replicated
 as well (some storageworks add-on or geom tricks).
 
 The point (for me) would be to not just replicate IMAP (never mind
 that most of our users use POP, me preferring not to use the 
 dovecot LDA, etc), but ALL of the services/infrastructure that make
 up a mail system. Which leads quickly back to HA/cluster/SAN/DRBD
 for me. 

I've found myself pretty much in the same all roads lead to the
filesystem situation. I don't want to replicate just imap, I want to
replicate the build directory with my source code, my email, and my MP3
files.

So maybe the right thing to do here is have dovecot do the locking and
proxying bit, and initially use librsync for the actual replication. The
rsync bit could be replaced with plugins for various filesystems.



Re: [Dovecot] Replication plans

2007-05-18 Thread Bill Boebel
On Fri, May 18, 2007 1:42 am, Troy Benjegerdes [EMAIL PROTECTED] said:

 I'm going to throw out a warning that it's my feeling that replication
 has ended many otherwise worthwhile projects. Once you go down that
 rabbit hole, you end up finding out the hard way that you just can't
 avoid the stability, performance, complexity, and whatever problems.
 ..
 I've found myself pretty much in the same all roads lead to the
 filesystem situation. I don't want to replicate just imap, I want to
 replicate the build directory with my source code, my email, and my MP3
 files.

One of the problems with the clustered file system approach seems to be that 
accessing Dovecot's index, cache and control files are slow over the network.  
For speed, ideally you want your index, cache and control on local disk... but 
still replicated to another server.

So what about tackling this replication problem from a different angle...  Make 
it Dovecot's job to replicate the index and control files between servers, and 
make it the file system's job to replicate just the mail data.  This would 
require further disconnecting the index and control files from the mail data, 
so that there is less syncing required.  i.e. remove the need to check 
directory mtimes and to compare directory listings against the index; and 
instead assume that the indexes are always correct.  Periodically you could 
still check to see if a sync is needed, but you'd do this must less frequently.

I agree that there are already great solutions available for replicated 
storage, so this would allow us to take advantage of these solutions for the 
bulk of our storage without impacting the speed of IMAP.

Bill



Re: [Dovecot] Replication plans

2007-05-18 Thread Timo Sirainen
On Fri, 2007-05-18 at 12:20 -0400, Bill Boebel wrote:
 So what about tackling this replication problem from a different
 angle...  Make it Dovecot's job to replicate the index and control
 files between servers, and make it the file system's job to replicate
 just the mail data.  This would require further disconnecting the
 index and control files from the mail data, so that there is less
 syncing required.  i.e. remove the need to check directory mtimes and
 to compare directory listings against the index; and instead assume
 that the indexes are always correct.  Periodically you could still
 check to see if a sync is needed, but you'd do this must less
 frequently.

This would practically mean that you want either cydir or dbox storage.

This kind of a hybrid replication / clustered filesystem implementation
would simplify the replication a bit, but all the difficult things
related to UID conflicts etc. will still be there. So I wouldn't mind
implementing this, but I think implementing the message content sending
via TCP socket wouldn't add much more code anymore.

The clustered filesystem could probably be used to simplify some things
though, such as UID allocation could be done by renaming a uid-next
uid file. If the rename() succeeded, you allocated the UID, otherwise
someone else did and you'll have to find the new filename and try again.
But I'm not sure if this kind of a special-case handling would be good.
Unless of course I decide to use the same thing for non-replicated
cydir/dbox.



signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] Replication plans

2007-05-18 Thread Bill Boebel
On Fri, May 18, 2007 1:10 pm, Timo Sirainen [EMAIL PROTECTED] said:

 On Fri, 2007-05-18 at 12:20 -0400, Bill Boebel wrote:
 So what about tackling this replication problem from a different
 angle...  Make it Dovecot's job to replicate the index and control
 files between servers, and make it the file system's job to replicate
 just the mail data.  This would require further disconnecting the
 index and control files from the mail data, so that there is less
 syncing required.  i.e. remove the need to check directory mtimes and
 to compare directory listings against the index; and instead assume
 that the indexes are always correct.  Periodically you could still
 check to see if a sync is needed, but you'd do this must less
 frequently.
 
 This would practically mean that you want either cydir or dbox storage.
 
 This kind of a hybrid replication / clustered filesystem implementation
 would simplify the replication a bit, but all the difficult things
 related to UID conflicts etc. will still be there. So I wouldn't mind
 implementing this, but I think implementing the message content sending
 via TCP socket wouldn't add much more code anymore.
 
 The clustered filesystem could probably be used to simplify some things
 though, such as UID allocation could be done by renaming a uid-next
 uid file. If the rename() succeeded, you allocated the UID, otherwise
 someone else did and you'll have to find the new filename and try again.
 But I'm not sure if this kind of a special-case handling would be good.
 Unless of course I decide to use the same thing for non-replicated
 cydir/dbox.

Dbox definitely sounds promising.  I'd avoid putting this uid-nextuid file in 
the same location as the mail storage though, because if you can truly keep 
mail storage isolated and infrequently accessed, then you can do cool things 
like store your mail data remotely on Amazon S3 or equivalent.




Re: [Dovecot] Replication plans

2007-05-18 Thread Troy Benjegerdes
On Fri, May 18, 2007 at 12:20:13PM -0400, Bill Boebel wrote:
 On Fri, May 18, 2007 1:42 am, Troy Benjegerdes [EMAIL PROTECTED] said:
 
  I'm going to throw out a warning that it's my feeling that replication
  has ended many otherwise worthwhile projects. Once you go down that
  rabbit hole, you end up finding out the hard way that you just can't
  avoid the stability, performance, complexity, and whatever problems.
  ..
  I've found myself pretty much in the same all roads lead to the
  filesystem situation. I don't want to replicate just imap, I want to
  replicate the build directory with my source code, my email, and my MP3
  files.
 
 One of the problems with the clustered file system approach seems to be that 
 accessing Dovecot's index, cache and control files are slow over the network. 
  For speed, ideally you want your index, cache and control on local disk... 
 but still replicated to another server.


Don't assume that the network is slower than disk.. Both InfiniBand and
10Gigabit ethernet are about 10-20 times faster on raw bandwidth than a
single disk spindle, and around 100-1000 times lower latency if you can
get the data out of another node's RAM. (10 or 100 microseconds instead
of 10 milliseconds for a disk seek). 

If what you want is speed, you want to keep the data in RAM... or at
least in the RAM-backed OS buffer cache.. If the index, cache, and
control files can be replicated to every node and still leave say, half
the memory for actual message data, you win. If the replicated data
files start pushing each other out of memory, you lose, and would be
better off with the proxy approach where each node can be responsible
for a portion of the index, cache, and control files.

For what it's worth, AFS 'replicates' the file data to a local
disk cache.. Linux NFS with cachefs will also support a local
disk-cache backed network filesystem. Where AFS (and probably
nfs+cachefs) fall down is when the files (or directories) are changing
a lot and you have to go back to the server all the time to fetch a new
version. So maildir is a big win, except when a new message gets
delivered and the clients all have to go fetch a new directory list from
the fileserver.


 So what about tackling this replication problem from a different angle...  
 Make it Dovecot's job to replicate the index and control files between 
 servers, and make it the file system's job to replicate just the mail data.  
 This would require further disconnecting the index and control files from the 
 mail data, so that there is less syncing required.  i.e. remove the need to 
 check directory mtimes and to compare directory listings against the index; 
 and instead assume that the indexes are always correct.  Periodically you 
 could still check to see if a sync is needed, but you'd do this must less 
 frequently.
 
 I agree that there are already great solutions available for replicated 
 storage, so this would allow us to take advantage of these solutions for the 
 bulk of our storage without impacting the speed of IMAP.


I suppose that to really be able to reduce the mtime lookups and
syncing, you'd probably need to use dbox so that there isn't the
possibility of some other program accessing the maildirs.


Re: [Dovecot] Replication plans

2007-05-17 Thread Troy Benjegerdes
My first reaction is that I've already got replication by running
dovecot on AFS ;)

But that's currently not *really* replicated. The real question I guess
is why not use a cluster/distributed/san filesystem like AFS, GFS,
Lustre, GPFS to handle the actual data, and specify that replication
only works for maildir or other single file per message mailbox formats.

This puts about half the replication problem on the filesystem, and I
would hope keeps the dovecot code a lot simpler. The hard part for
dovecot is understanding the locking semantics of various filesystems,
and doing locking in a way that doesn't totally trash performance.

I also tend to have imap clients open on multiple machines, so the
assumption that a user's mailbox will only be accessed from 1 IP is
probably a bad one.

I'd suggest that multi-master replication be implemented on a
per-mailbox basis, so that for any mailbox, the first dovecot instance
that gets a connection for that mailbox checks with a 'dovelock' process
that can either backend to a shared filesystem, or implement it's own
tcp-based lock manager. It would then get a master lock on that mailbox.
If a second dovecot gets a connection to a mailbox with a master lock,
it would act as a proxy and redirect all writes to the master. In the
case of a shared filesystem, all reads can still come from the straight
from the local filesystem. In the case of AFS, this might hit the
directory lookups kind of hard, but I think would still be a big win
because of local cacheing of all the maildir files, which for the most
part are read-only.

In the event that a master crashes or stops responding, the dovelock
process would need a mechanism to revoke a master lock.. this might
require some filesystem-specific glue to make sure that once a lock is
revoked the dovecot process that had it's lock revoked can't write to
the filesystem anymore. I can think of at least one way to do this with
AFS by having the AFS server invalidate authentication tokens for the
process/server that the lock was revoked from.

Once the lock is revoked, the remaining process can grab it and go on
happily with life. Once the dead/crashed process comes back, it just has
to have it's filesystem go out and update to the latest files.

I like the idea of dovecot having a built-in proxy/redirect ability so a
cluster can be done with plain old round-robin DNS. I also think that in
most cases, if there's a dead server in the round-robin DNS pool, most
clients will retry automatically, or the user will get an error, click
okay, and try again. Having designated proxy or redirectors in the
middle just makes things complicated and hard to debug.

On Thu, May 17, 2007 at 04:45:21PM +0300, Timo Sirainen wrote:
 Several companies have been interested in getting replication support
 for Dovecot. It looks like I could begin implementing it in a few months
 after some other changes. So here's how I'm planning on doing it:
 
 What needs to be replicated:
 
  - Saving new mails (IMAP, deliver)
  - Copying existing mails
  - Expunges
  - Flag and keyword changes
  - Mailbox creations, deletions and renames
  - Subscription list
  - IMAP extension changes, such as ACLs
 
 I'll first talk about only master/slave configuration. Later then
 master/multi-slave and multi-master.
 
 Comments?
 
 Basic design
 
 
 Since the whole point of master-slave replication would be to get a
 reliable service, I'll want to make the replication as reliable as
 possible. It would be easier to implement much simpler master-slave
 replication, but in error conditions that would almost guarantee that
 some messages get lost. I want to avoid that.
 
 Both slave and master would be running a dovecot-replication process.
 They would talk to each others via TCP connection. Master would be
 feeding changes (transactions) to slave, and slave would keep replying
 how far it's committed the changes. So there would be an increasing
 sequence number sent with each transaction.
 
 Master keeps all the changes in memory until slave has replied that it
 has committed the changes. If the memory buffer gets too large (1MB?)
 because the slave is handling the input too slowly or because it's
 completely dead, the master starts writing the buffer to a file. Once
 the slave is again responding the changes are read from the file and
 finally the file gets deleted.
 
 If the file gets too large (10MB?) it's deleted and slave will require a
 resync. Master always keeps track of user/mailbox - last transaction
 sequence in memory. When the slave comes back up and tells the master
 its last committed sequence, this allows the master to resync only those
 mailboxes that had changed. If this mapping doesn't exist (master was
 restarted), all the users' mailboxes need to be resynced.
 
 The resyncing will also have to deal with new mailboxes, ACL changes and
 such. I'm not completely sure yet how the resyncing will work. Probably
 the only important thing here is that it 

Re: [Dovecot] Replication plans

2007-05-17 Thread Aredridel
Jonathan wrote:
 Hi Timo,

 MySQL gets around the problem of multiple masters allocating the
 same primary key, by giving each server its own address range
 (e.g. first machine uses 1,5,9,13 next one uses 2,6,10,14,...).
 Would this work for UIDs?
UIDs have to be sequential. Causes some problems.


Re: [Dovecot] Replication plans

2007-05-17 Thread Timo Sirainen
On Thu, 2007-05-17 at 10:04 -0500, Troy Benjegerdes wrote:
 But that's currently not *really* replicated. The real question I guess
 is why not use a cluster/distributed/san filesystem like AFS, GFS,
 Lustre, GPFS to handle the actual data, and specify that replication
 only works for maildir or other single file per message mailbox formats.

This already works, but people don't seem to want to use it. There are
apparently some stability, performance, complexity and whatever
problems. And if you're planning on replicating to a remote location far
away they're really slow. But I haven't personally tried any of them, so
all I can say is that people are interested in getting replication that
doesn't require clustered filesystems, so I'll build it for them. :)

Then there are also these special users, such as me. :) I'm now running
my own Dovecot server on the same server that receives my emails. I want
to use my laptop from different locations, so I want to keep that
working. But then again my desktop computer is faster than the remote
server, so I'd want to run another Dovecot in the desktop which
replicates with the remote server.

Then there are also people who would want to run Dovecot on their laptop
and have it synchronize with the main server whenever network connection
is available.

 I also tend to have imap clients open on multiple machines, so the
 assumption that a user's mailbox will only be accessed from 1 IP is
 probably a bad one.

Yes, I know. But usually there's only one client that's actively
modifying the mailbox. The readers don't need global locking because
they're not changing anything. Except \Recent flag updates..

 If a second dovecot gets a connection to a mailbox with a master lock,
 it would act as a proxy and redirect all writes to the master.

Actually this was my original plan, but then I decided that lock
transfer is probably more efficient than proxying. But I'll add support
for proxying in any case for accessing shared mailboxes in other
servers, so if you prefer proxying it'll be easy to support too.

 In the event that a master crashes or stops responding, the dovelock
 process would need a mechanism to revoke a master lock.. this might
 require some filesystem-specific glue to make sure that once a lock is
 revoked the dovecot process that had it's lock revoked can't write to
 the filesystem anymore. I can think of at least one way to do this with
 AFS by having the AFS server invalidate authentication tokens for the
 process/server that the lock was revoked from.

I was hoping to find an existing global lock manager. I think there
exist some. If not, then I'll have to spend some more time thinking
about it. Anyway everyone don't want to use AFS, so I can't rely on
it :)



signature.asc
Description: This is a digitally signed message part