Tom Brown wrote:
> Hi
> 
> I need to design and then build a clustered setup that is scalable that 
> will distribute our MTA's across 2 of our datacentres. I have 4 boxes so 
> there will be 2 in each location, more should be able to be added later 
> if required. I will be configuring this in a master/standby config 
> rather than balancing the load between them. I am thinking about running 
> a linux ha cluster on these boxes and just treating all 4 as 2 sets of 2.
> 
> I guess my questions are does exim play nicely in a linux ha type 
> situation and if not what other ways can be employed to maintain a ha 
> cluster of mta's ?
> 
> thanks
> 

I cannot answer the Linux HA part. (FreeBSD here)

But, w/r '...other ways.. Exim is *golden* - especially w/r managing ssl/tls 
certs and such.

The concept of hot & standby is sound, with or without an actual cluster.

When co-located and in the same IP block, it is dead-easy to manage two 
'ordinary' boxen w/r failover & restoral, so a formal cluter has just not been 
an issue in our camp.

- Syncing message store is the only real challenge, and that is not a 
show-stopper.

- Conventional secondary and subsequent MX are not 100% predictable as to where 
inbound traffic may end up, and getting it to where it may be read by pop or 
imap w/o need for the users to alter MUA settings can increase complexity, 
raise 
box-count, and add latency. We have chose to publish just one mx.

- it is faster and 'cleaner' to repoint BOTH smtp and pop/imap to the standby 
by 
means of IP-takeover than by DNS changes. No MUA changes required.

- IMNSHO, maintenance of a 'prime' and 'secondary' is less work, espacially 
when 
each is really a 'prime' that can carry double, is in day-to-day service so you 
know it has not gone off - or out-of-date - while sitting on standby.

Our approach:

Each of two 'heavy' 2U servers (Tyan MB, dial Gig-E, Core-Duo CPU, 4 GB RAM, 
triple RAID1 arrays) have an 'always mine' frontside IP - primarily for ssh 
access.

Each also has a 'public' IP which may be downed and taken over by the other box.
This is where the DNS point each 'set' of domains.

- These two (or more) IP are aliased onto the same 'external' NIC.

- On another NIC, each has an 'internal' IP on a backside LAN. Primary use is 
data exchange & local storage, but it also serves to ssh from another box 
if/as/when both frontside IP are wanted offline.

- *Normally* each server handles its own sepaate set of virtual domains, (per 
what the DNS points to) ergo nothing is really 'hot' or 'standby - just two 
lightly-loaded servers, each with more than enough reserve to do the entire job 
of both.

The config's are identical, and both servers have each other's certs available.

The virtual user DB (PostgreSQL in this case, but it need not be so) is also 
identical, i.e. - each server has all the data and storage structures it needs 
to do BOTH sets of domains and virtual users for smtp and IMAP.

The 'master' DB is one a third, 1U Via C3 single-RAID1 box, which does not have 
to be on the same site (though ours are). Draws about 12W or less and lasts a 
long, long time.

Changes to the user DB may be made here. If it is hors de combat, traffic is 
not 
  significantly affected, other than as to new users or spam filter preferences.

Manual changes may be made directly to the two main boxen DB if there is a long 
outage.

Day-to-day syncing:

For light loading, periodic rsync may be 'good enough' to keep bothway message 
stores reasonably current. NB: Our users ordinarily also have local sync'ed 
IMAP 
copies, so even if mailstore on the servers is not current, will still be able 
to refer to older messages.

Shared external NAS with RAID storage can reduce that need, but becomes a 
single-point-of-failure.

Failover:

- alias the 'public' IP of the offline box to the survivor. Make sure the 
offline box does not come back on the net until you are ready for it.

Recovery:

- drop that alias when the offline  box is ready to go back to work. It may be 
tested on the 'private' IP before the 'public' IP is turned back on.

(we allow ip_literal for postmaster and such ...)

All else is already in-place.

Given the reliability of modern hardware and RAID, (a hard failure about once 
every 3 to 5 years), this seems to make better use of the resources before they 
go obsolete while on standby, and has allowed us to cut our server-count and 
UPS 
power budget roughly in half.

One of the major driving factors was to not have any significant interruption 
in 
IMAP access, (we do not offer pop).

IOW - transparency, and de facto redundancy, but with minimal 'idle' investment.

I don't know if a Linux HA cluster would make this simpler - or more complex - 
that probably depends on the expertise of the implementor. But at least I can 
say that a cluster is not mandatory [1].

HTH,

Bill

[1] I've 'presumed' a Linux environment that need not be taken offline for 
application of upgrades/patches, i.e. - is normally run through a 45-60 second 
reboot only after a no more than quarterly or bi-annual 'make 
buildworld/kernel' 
cycle - or whatever the Linux equivalent is.

Exim and other upgrades and patches are done without going offline, the 
listener 
daemon re-hupped in a few seconds.


-- 
## List details at http://www.exim.org/mailman/listinfo/exim-users 
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://www.exim.org/eximwiki/

Reply via email to