Re: Cyrus crashed on redundant platform - need better availability?

2004-09-16 Thread Paul Dekkers
Hi,
Ken Murchison wrote:
I wouldn't hold out hope of anything being available in some months.
I wrote my replication code two years ago, and submitted it to Rob 
and Ken about this time last year. Neither I or they have put any 
significant work into the code since then. As I indicated in my 
previous message, we all have other priorities right now.
I can imagine, but I hoped that priorities would change a bit with 
the amount of users that repeatedly 
This link appears dead.  All I get is To clipboard.
Oops. There was never supposted to be a link :-)
interest in this feature and with the money we are willing to put in :-|
I'm willing to work on it if there is money available.  You are the 
only one that has says that you would commit money.  Where are the 
rest of the folks?  Based on the number of people that stepped up to 
pay for virtdomains support (zero), I'm guessing there are fewer out 
there willing to spend money than you think.  But I could be wrong.
I'm happy to see that there are indeed others interested in this ;-)
Other than the altnamespace project ($5000) that I did for a (unnamed) 
company in Texas, Jeremy Howard at Fastmail is the only one who has 
consistently paid for features.  I'll let him disclose what he has 
spent, if he chooses to, but its safe to say that its been more than 
just pizza and beer.
I expected more then pizza and beer, so that's no surprise :-)
I'd have to look at David's patch again and discuss things with CMU to 
get a good time estimate, but I'm guessing that a project like this 
would cost a few thousand dollars.
Ok, I'll start a discussion with our management based on your latest 
estimation ($3000-$5000) and I'll let you know about the results. (Might 
take a while, I think at least not this week. If you have more details 
(for instance time estimation) let me know.)

Bye,
Paul

---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-15 Thread Sebastian Hagedorn
Hi,
--On Freitag, 10. September 2004 16:27 Uhr +0200 Paul Dekkers 
[EMAIL PROTECTED] wrote:

Right, works fine for us for the most part. Hasn't always been like
that, but the most recent kernel updates by Red Hat have improved
matters a lot.
What did the kernel improve?
memory management for the most part. With 8 GB of RAM and lots of it free 
there were previously situations where either the cache grew too large, 
causing the machine to become extremely slow, or where forks failed (even 
though there were oodles of free RAM). Both seem to have been resolved in 
2.4.9-e.49enterprise.

You are not using a clustered filesystem,
right?
No.
Although many on the list claim that this (having 2 boxes with 1
disk-array) is a nice way for redundancy I'm in doubt now if this is
true.
It's good but not perfect. We recently installed a huge SAN and are
now in the process of moving over the mail data to reside there.
Fibrechannel seems to be much more error tolerant than SCSI.
Hmm, I don't expect the problems to be SCSI-related. Maybe it has to do
with GEOM and SMP in FreeBSD 5.2.1, but not the SCSI-bus itself. (There
are two seperate controllers for both machines, they never see each other
on the same SCSI bus...)
That's not what I was talking about. We have a similar setup, yet still 
there were instances when Red Hat's cluster software failed to write to the 
shared storage. I guess this was caused by the slow-downs connected to the 
memory management, but Red Hat support indicated that shared storage 
connected via FibreChannel would not have been as susceptible to these 
problems.

--On Freitag, 10. September 2004 21:36 Uhr +0200 Jure PeÄ?ar 
[EMAIL PROTECTED] wrote:

The kernel that shipped with RedHat AS 2.1 was useless for most of the
tasks i tried it with. About three revisions later it became somewhat
more usefull for non-oracle types of use, but i've rolled my own and am
not following the state of it now.
That's fine if you don't have to rely on commercial support. Our management 
decided to go the supported path all the way. That doesn't leave you many 
options. I have to say that when it works, the cluster software works 
extremely well. It's just that it hasn't always worked in the past ... ;-)

I haven't had problems with the fiber itself, i've only had lots of fun
with the firmware on the disks themselves and some with the qlogic
drivers.
We've had our share of problems with those as well, but I hear that Red Hat 
AS 3.0 ships with working QLogic drivers that work out of the box.

Cheers, Sebastian Hagedorn
--
Sebastian Hagedorn M.A. - RZKR-R1 (Gebäude 52), Zimmer 18
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
Universität zu Köln / Cologne University - Tel. +49-221-478-5587

pgpmuwLb5sS5G.pgp
Description: PGP signature


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-15 Thread Paul Dekkers
Hi,
Sebastian Hagedorn wrote:
You are not using a clustered filesystem,
right?
No.
I can imagine that would be one of the advantages of RH's clustering, 
since you don't have to mount a filesystem in that case for a machine 
that just crashed - it would safe time...
But I suppose RH's cluster manager takes care of mounting the partitions 
and checking them if there are any errors.

It's good but not perfect. We recently installed a huge SAN and are
now in the process of moving over the mail data to reside there.
Fibrechannel seems to be much more error tolerant than SCSI.

Where you working with a multi-initiator enviroment (as RH calls it) 
or single initiator (e.g. with 2 machines on exactly the same SCSI 
bus, or two seperate interfaces on your array's SCSI controller?)
I think with a multi-initiator enviroment (as we have it) there is a 
very limited chance of failures.

Hmm, I don't expect the problems to be SCSI-related. Maybe it has to 
do...
That's not what I was talking about. We have a similar setup, yet 
still there were instances when Red Hat's cluster software failed to 
write to the shared storage. I guess this was caused by the slow-downs 
connected to the memory management, but Red Hat support indicated that 
shared storage connected via FibreChannel would not have been as 
susceptible to these problems.
Do you think using RH's cluster software is a valuable consideration for 
this kind of clustering setup? Using FreeBSD there are not that many 
clustering solutions for now, and if it's advisable to at least consider 
using RH here (although I have no experience with RH) we can certainly 
look at it. (Any idea how fast RH would recover services?)

On the other hand, if there is a application level redundancy on its 
way, it doesn't really matter on what platform the machine runs, so it 
would still make me happier and even with FreeBSD. And I would rather 
put my money there. Even if it means we'll have to wait for some months, 
we would do that and take the risk of running on a less 
automatic-failover-situation with a worst-case downtime of 30 mins (or 
2 mins regulary with sync-mounted filesystems now).

The kernel that shipped with RedHat AS 2.1 was useless for most of the
tasks i tried it with. About three revisions later it became somewhat
more usefull for non-oracle types of use, but i've rolled my own and am
not following the state of it now.
That's fine if you don't have to rely on commercial support. Our 
management decided to go the supported path all the way. That doesn't 
leave you many options. I have to say that when it works, the cluster 
software works extremely well. It's just that it hasn't always worked 
in the past ... ;-)
That's a plus for RH (ES|AS) 3
Paul
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-15 Thread David Carter
On Wed, 15 Sep 2004, Paul Dekkers wrote:
On the other hand, if there is a application level redundancy on its 
way, it doesn't really matter on what platform the machine runs, so it 
would still make me happier and even with FreeBSD. And I would rather 
put my money there. Even if it means we'll have to wait for some months,
I wouldn't hold out hope of anything being available in some months.
I wrote my replication code two years ago, and submitted it to Rob and Ken 
about this time last year. Neither I or they have put any significant work 
into the code since then. As I indicated in my previous message, we all 
have other priorities right now.

--
David Carter Email: [EMAIL PROTECTED]
University Computing Service,Phone: (01223) 334502
New Museums Site, Pembroke Street,   Fax:   (01223) 334679
Cambridge UK. CB2 3QH.
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-15 Thread Paul Dekkers
David Carter wrote:
On Wed, 15 Sep 2004, Paul Dekkers wrote:
On the other hand, if there is a application level redundancy on its 
way, it doesn't really matter on what platform the machine runs, so 
it would still make me happier and even with FreeBSD. And I would 
rather put my money there. Even if it means we'll have to wait for 
some months,
I wouldn't hold out hope of anything being available in some months.
I wrote my replication code two years ago, and submitted it to Rob and 
Ken about this time last year. Neither I or they have put any 
significant work into the code since then. As I indicated in my 
previous message, we all have other priorities right now.
I can imagine, but I hoped that priorities would change a bit with the 
amount of users that repeatedly 
http://www.interglot.com/toclipboard.php?b=1d=2t=herhaaldelijks=herhaaldelijkw=repeatedlyshowed 
interest in this feature and with the money we are willing to put in :-|

Paul
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-15 Thread Jure Pear
On Wed, 15 Sep 2004 13:38:43 +0200
Paul Dekkers [EMAIL PROTECTED] wrote:

 But I suppose RH's cluster manager takes care of mounting the partitions 
 and checking them if there are any errors.

Not really, at least not by itself. See
http://people.redhat.com/jrfuller/cms/ for detailed documentation of what is
included with RH AS 2.1 (it's some $500 extra for AS 3).
I had to write some pretty paranoid scripts that take care of assembling
software raids, checking the fs and mountig it while taking care about the
other machine to prevent problems.

Of course all this would be much easier with some kind of clustered fs, but
clustered fs brings a new problem: locking. Almost all i've seen so far have
an external 'locking manager' on a separate box, which brings ethernet
latency into every lock operation, which i'm sure is very noticable in the
lock-heavy usage patterns as mail is. But this is just my feeling, i haven't
yet benchmarked any :)

 Do you think using RH's cluster software is a valuable consideration for 
 this kind of clustering setup? Using FreeBSD there are not that many 
 clustering solutions for now, and if it's advisable to at least consider 
 using RH here (although I have no experience with RH) we can certainly 
 look at it. (Any idea how fast RH would recover services?)

This RH cluster software is nothing fancy; i'm sure equivalents exists for
BSDs. See documentation link above. Actually it is just Kimberlite
(http://oss.missioncriticallinux.com/projects/kimberlite/), sold with RedHat
support.
Speed of recovery is almost completely out of the cluster control. The
only thing that matters for the cluster is what your cyrus init script
returns when called with 'status' parameter. Everything else is up to your
init scripts.
Of course, if one box dies completely, the other takes over in the
configurable time.


-- 

Jure Pear
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-15 Thread Sebastian Hagedorn
Hi,
--On Mittwoch, 15. September 2004 13:38 Uhr +0200 Paul Dekkers 
[EMAIL PROTECTED] wrote:

You are not using a clustered filesystem,
right?
No.
I can imagine that would be one of the advantages of RH's clustering,
since you don't have to mount a filesystem in that case for a machine
that just crashed - it would safe time...
I'm not sure if Red Hat even supports a clustered FS at this time. It 
certainly didn't when we set up the system more than two years ago.

But I suppose RH's cluster manager takes care of mounting the partitions
and checking them if there are any errors.
Right. The unmounting/mounting of partitions usually works fine, but there 
have been problems at times. The worst one was causing alternating crashes 
of both nodes:

sd(8,73)): ext3_free_blocks: Freeing blocks not in datazone - block = 
225139276, count = 1
EXT3-fs error (device sd(8,73)): ext3_free_blocks: Freeing blocks not in 
datazone - block = 1919637002, count = 1
EXT3-fs error (device sd(8,73)): ext3_free_blocks: Freeing blocks not in 
datazone - block = 894788200, count = 1
EXT3-fs error (device sd(8,73)): ext3_free_blocks: Freeing blocks not in 
datazone - block = 1883792719, count = 1
EXT3-fs error (device sd(8,73)): ext3_free_blocks: Freeing blocks not in 
datazone - block = 1347113037, count = 1
EXT3-fs error (device sd(8,73)): ext3_free_blocks: Freeing blocks not in 
datazone - block = 829312330, count = 1
EXT3-fs error (device sd(8,73)): ext3_free_blocks: Freeing blocks not in 
datazone - block = 893538370, count = 1
EXT3-fs error (device sd(8,73)): ext3_free_blocks: Freeing blocks not in 
datazone - block = 1450341715, count = 1
EXT3-fs error (device sd(8,73)): ext3_free_blocks: Freeing blocks not in 
datazone - block = 909390198, count = 1
EXT3-fs error (device sd(8,73)): ext3_free_blocks: Freeing blocks not in 
datazone - block = 1366706293, count = 1
EXT3-fs error (device sd(8,73)): ext3_free_blocks: Freeing blocks not in 
datazone - block = 846548333, count = 1
EXT3-fs error (device sd(8,73)): ext3_free_blocks: Freeing blocks not in 
datazone - block = 1630746450, count = 1
EXT3-fs error (device sd(8,73)): ext3_free_blocks: Freeing blocks not in 
datazone - block = 860649837, count = 1
EXT3-fs error (device sd(8,73)

leading to this:
Assertion failure in journal_forget_Rsmp_094dfde7() at transaction.c:1226: 
!jh-b_committed_data
[ cut here ]
kernel BUG at transaction.c:1226!
invalid operand: 
Kernel 2.4.9-e.38enterprise
CPU:3
EIP:0010:[f885b636]Not tainted
EFLAGS: 00010282
EIP is at journal_forget_Rsmp_094dfde7 [jbd] 0xd6
eax: 0025   ebx: ce6e8c10   ecx: c02f7f84   edx: 0008dad9
esi: cd95f3e0   edi: cd7a3094   ebp: cd7a3000   esp: cb947d70
ds: 0018   es: 0018   ss: 0018
Process ctl_cyrusdb (pid: 4500, stackpage=cb947000)
Stack: f8863f30 04ca e7b08b20 cd95f3e0 cd7a3000 000b cd95f3e0 
f885ee69
  ce14ac40 cd95f3e0 cd95f3e0 cd95f3e0 cab35900 ce14ac40 f886bc8c 
ce14ac40
  0002 cd95f3e0 cd95f3e0 cd6ad000 cd6ae000 cdd93000 cd95f3e0 
0002
Call Trace: [f8863f30] .LC7 [jbd] 0x0 (0xcb947d70)
[f885ee69] journal_revoke_Rsmp_56fa5ece [jbd] 0xf9 (0xcb947d8c)
[f886bc8c] ext3_forget [ext3] 0x7c (0xcb947da8)
[f886df3a] ext3_free_branches [ext3] 0xda (0xcb947dd8)
[f886df2c] ext3_free_branches [ext3] 0xcc (0xcb947e30)
[f886e2ec] ext3_truncate [ext3] 0x2bc (0xcb947e74)
[f885a285] start_this_handle [jbd] 0x125 (0xcb947eac)
[f885a38f] journal_start_Rsmp_ec53be73 [jbd] 0xbf (0xcb947ec4)
[f886bd5e] start_transaction [ext3] 0x4e (0xcb947ee4)
[f886bee7] ext3_delete_inode [ext3] 0xe7 (0xcb947f08)
[f887a080] ext3_sops [ext3] 0x0 (0xcb947f28)
[c015dd1c] iput_free [kernel] 0x14c (0xcb947f2c)
[f886f9c3] ext3_lookup [ext3] 0x73 (0xcb947f40)
[c015addb] dentry_iput [kernel] 0x4b (0xcb947f50)
[c01541ab] vfs_unlink [kernel] 0x1eb (0xcb947f60)
[c0152c41] lookup_hash [kernel] 0x91 (0xcb947f6c)
[c015427a] sys_unlink [kernel] 0x9a (0xcb947f88)
[c01181c0] do_page_fault [kernel] 0x0 (0xcb947fb0)
[c01073e3] system_call [kernel] 0x33 (0xcb947fc0)

Code: 0f 0b 59 58 53 e8 40 03 00 00 8b 43 24 c7 43 14 00 00 00 00
0Kernel panic: not continuing
I had to intercept the boot process manually before the cluster software 
starts and fsck the partition. Not good. But this problem has been fixed in 
a kernel update.

It's good but not perfect. We recently installed a huge SAN and are
now in the process of moving over the mail data to reside there.
Fibrechannel seems to be much more error tolerant than SCSI.

Where you working with a multi-initiator enviroment (as RH calls it) or
single initiator (e.g. with 2 machines on exactly the same SCSI bus, or
two seperate interfaces on your array's SCSI controller?)
I think with a multi-initiator enviroment (as we have it) there is a very
limited chance of failures.
I'm not sure about the terminology, but we have two separate SCSI busses on 
the RAID, one for each host. I thought that was single initiator? The 
problem that regularly occurred is the 

Re: Cyrus crashed on redundant platform - need better availability?

2004-09-15 Thread Simon Matter
 Hi,

 --On Mittwoch, 15. September 2004 13:38 Uhr +0200 Paul Dekkers
 [EMAIL PROTECTED] wrote:

 You are not using a clustered filesystem,
 right?

 No.

 I can imagine that would be one of the advantages of RH's clustering,
 since you don't have to mount a filesystem in that case for a machine
 that just crashed - it would safe time...

 I'm not sure if Red Hat even supports a clustered FS at this time. It
 certainly didn't when we set up the system more than two years ago.

I thinks that's exactly why they bought Sistina with GFS - and GPL'd it.
Does anybody know how it works with cyrus-imapd?

Simon


---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-15 Thread Ken Murchison
Paul Dekkers wrote:
David Carter wrote:
On Wed, 15 Sep 2004, Paul Dekkers wrote:
On the other hand, if there is a application level redundancy on its 
way, it doesn't really matter on what platform the machine runs, so 
it would still make me happier and even with FreeBSD. And I would 
rather put my money there. Even if it means we'll have to wait for 
some months,

I wouldn't hold out hope of anything being available in some months.
I wrote my replication code two years ago, and submitted it to Rob and 
Ken about this time last year. Neither I or they have put any 
significant work into the code since then. As I indicated in my 
previous message, we all have other priorities right now.

I can imagine, but I hoped that priorities would change a bit with the 
amount of users that repeatedly 
http://www.interglot.com/toclipboard.php?b=1d=2t=herhaaldelijks=herhaaldelijkw=repeatedlyshowed 
This link appears dead.  All I get is To clipboard.

interest in this feature and with the money we are willing to put in :-|
I'm willing to work on it if there is money available.  You are the only 
one that has says that you would commit money.  Where are the rest of 
the folks?  Based on the number of people that stepped up to pay for 
virtdomains support (zero), I'm guessing there are fewer out there 
willing to spend money than you think.  But I could be wrong.

Other than the altnamespace project ($5000) that I did for a (unnamed) 
company in Texas, Jeremy Howard at Fastmail is the only one who has 
consistently paid for features.  I'll let him disclose what he has 
spent, if he chooses to, but its safe to say that its been more than 
just pizza and beer.

I'd have to look at David's patch again and discuss things with CMU to 
get a good time estimate, but I'm guessing that a project like this 
would cost a few thousand dollars.

--
Kenneth Murchison Oceana Matrix Ltd.
Software Engineer 21 Princeton Place
716-662-8973 x26  Orchard Park, NY 14127
--PGP Public Key--http://www.oceana.com/~ken/ksm.pgp
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-15 Thread David Lang
also take a look at the heartbeat package at linux-ha.org This works on 
linux, *BSD, and solaris (there were people working on a AIX port, but 
they apparently dropped it shortly before finishing)

David Lang
 On Wed, 15 Sep 2004, 
Jure [UTF-8] PeÄ~Mar wrote:

Date: Wed, 15 Sep 2004 17:07:20 +0200
From: Jure [UTF-8] PeÄ~Mar [EMAIL PROTECTED]
To: Paul Dekkers [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject: Re: Cyrus crashed on redundant platform - need better availability?
On Wed, 15 Sep 2004 13:38:43 +0200
Paul Dekkers [EMAIL PROTECTED] wrote:
But I suppose RH's cluster manager takes care of mounting the partitions
and checking them if there are any errors.
Not really, at least not by itself. See
http://people.redhat.com/jrfuller/cms/ for detailed documentation of what is
included with RH AS 2.1 (it's some $500 extra for AS 3).
I had to write some pretty paranoid scripts that take care of assembling
software raids, checking the fs and mountig it while taking care about the
other machine to prevent problems.
Of course all this would be much easier with some kind of clustered fs, but
clustered fs brings a new problem: locking. Almost all i've seen so far have
an external 'locking manager' on a separate box, which brings ethernet
latency into every lock operation, which i'm sure is very noticable in the
lock-heavy usage patterns as mail is. But this is just my feeling, i haven't
yet benchmarked any :)
Do you think using RH's cluster software is a valuable consideration for
this kind of clustering setup? Using FreeBSD there are not that many
clustering solutions for now, and if it's advisable to at least consider
using RH here (although I have no experience with RH) we can certainly
look at it. (Any idea how fast RH would recover services?)
This RH cluster software is nothing fancy; i'm sure equivalents exists for
BSDs. See documentation link above. Actually it is just Kimberlite
(http://oss.missioncriticallinux.com/projects/kimberlite/), sold with RedHat
support.
Speed of recovery is almost completely out of the cluster control. The
only thing that matters for the cluster is what your cyrus init script
returns when called with 'status' parameter. Everything else is up to your
init scripts.
Of course, if one box dies completely, the other takes over in the
configurable time.
--
Jure Peÿÿar
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
--
There are two ways of constructing a software design. One way is to make it so simple 
that there are obviously no deficiencies. And the other way is to make it so 
complicated that there are no obvious deficiencies.
 -- C.A.R. Hoare
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-15 Thread David Lang
how much are you asking for?
David Lang
On Wed, 15 Sep 2004, Ken Murchison wrote:
Date: Wed, 15 Sep 2004 11:44:45 -0400
From: Ken Murchison [EMAIL PROTECTED]
To: Paul Dekkers [EMAIL PROTECTED]
Cc: David Carter [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject: Re: Cyrus crashed on redundant platform - need better availability?
Paul Dekkers wrote:
David Carter wrote:
On Wed, 15 Sep 2004, Paul Dekkers wrote:
On the other hand, if there is a application level redundancy on its 
way, it doesn't really matter on what platform the machine runs, so it 
would still make me happier and even with FreeBSD. And I would rather 
put my money there. Even if it means we'll have to wait for some 
months,

I wouldn't hold out hope of anything being available in some months.
I wrote my replication code two years ago, and submitted it to Rob and 
Ken about this time last year. Neither I or they have put any 
significant work into the code since then. As I indicated in my previous 
message, we all have other priorities right now.

I can imagine, but I hoped that priorities would change a bit with the 
amount of users that repeatedly 
http://www.interglot.com/toclipboard.php?b=1d=2t=herhaaldelijks=herhaaldelijkw=repeatedlyshowed 
This link appears dead.  All I get is To clipboard.

interest in this feature and with the money we are willing to put in :-|
I'm willing to work on it if there is money available.  You are the only one 
that has says that you would commit money.  Where are the rest of the folks? 
Based on the number of people that stepped up to pay for virtdomains support 
(zero), I'm guessing there are fewer out there willing to spend money than 
you think.  But I could be wrong.

Other than the altnamespace project ($5000) that I did for a (unnamed) 
company in Texas, Jeremy Howard at Fastmail is the only one who has 
consistently paid for features.  I'll let him disclose what he has spent, if 
he chooses to, but its safe to say that its been more than just pizza and 
beer.

I'd have to look at David's patch again and discuss things with CMU to get a 
good time estimate, but I'm guessing that a project like this would cost a 
few thousand dollars.

--
Kenneth Murchison Oceana Matrix Ltd.
Software Engineer 21 Princeton Place
716-662-8973 x26  Orchard Park, NY 14127
--PGP Public Key--http://www.oceana.com/~ken/ksm.pgp
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
--
There are two ways of constructing a software design. One way is to make it so simple 
that there are obviously no deficiencies. And the other way is to make it so 
complicated that there are no obvious deficiencies.
 -- C.A.R. Hoare
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-15 Thread Simon Matter
 Paul Dekkers wrote:

 David Carter wrote:

 On Wed, 15 Sep 2004, Paul Dekkers wrote:

 On the other hand, if there is a application level redundancy on its
 way, it doesn't really matter on what platform the machine runs, so
 it would still make me happier and even with FreeBSD. And I would
 rather put my money there. Even if it means we'll have to wait for
 some months,


 I wouldn't hold out hope of anything being available in some months.

 I wrote my replication code two years ago, and submitted it to Rob and
 Ken about this time last year. Neither I or they have put any
 significant work into the code since then. As I indicated in my
 previous message, we all have other priorities right now.


 I can imagine, but I hoped that priorities would change a bit with the
 amount of users that repeatedly
 http://www.interglot.com/toclipboard.php?b=1d=2t=herhaaldelijks=herhaaldelijkw=repeatedlyshowed

 This link appears dead.  All I get is To clipboard.


 interest in this feature and with the money we are willing to put in :-|

 I'm willing to work on it if there is money available.  You are the only
 one that has says that you would commit money.  Where are the rest of
 the folks?  Based on the number of people that stepped up to pay for
 virtdomains support (zero), I'm guessing there are fewer out there
 willing to spend money than you think.  But I could be wrong.

 Other than the altnamespace project ($5000) that I did for a (unnamed)
 company in Texas, Jeremy Howard at Fastmail is the only one who has
 consistently paid for features.  I'll let him disclose what he has
 spent, if he chooses to, but its safe to say that its been more than
 just pizza and beer.

 I'd have to look at David's patch again and discuss things with CMU to
 get a good time estimate, but I'm guessing that a project like this
 would cost a few thousand dollars.

We are very interested in replicated shared folders. We have different
cyrus-imapd servers in different countries and would like to have common
shared folders. If this could also be implemented I'm sure we were able to
help sponsoring it.
There are also a number of commercial vendors of cyrus-imapd based
solutions who should be very interested in application level replication.

Simon


---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-15 Thread Ken Murchison
Simon Matter wrote:
Hi,
--On Mittwoch, 15. September 2004 13:38 Uhr +0200 Paul Dekkers
[EMAIL PROTECTED] wrote:

You are not using a clustered filesystem,
right?
No.
I can imagine that would be one of the advantages of RH's clustering,
since you don't have to mount a filesystem in that case for a machine
that just crashed - it would safe time...
I'm not sure if Red Hat even supports a clustered FS at this time. It
certainly didn't when we set up the system more than two years ago.

I thinks that's exactly why they bought Sistina with GFS - and GPL'd it.
Does anybody know how it works with cyrus-imapd?
If you are interested in using a shared filesystem on a SAN for server 
redundancy, then you could try using a replicated Murder (Cyrus 2.3). 
Such a config is running at a local University using 4 Sun servers and 
QFS (Sun's SAN filesystem) on a Hitachi fibre array.

I haven't tested this with GFS, but if it has correct file locking and 
memory mapping support, then it might work.

I'm fairly confident that SGI's XFS would work, although I haven't tried it.
--
Kenneth Murchison Oceana Matrix Ltd.
Software Engineer 21 Princeton Place
716-662-8973 x26  Orchard Park, NY 14127
--PGP Public Key--http://www.oceana.com/~ken/ksm.pgp
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-15 Thread Ken Murchison
David Lang wrote:
how much are you asking for?
Since this is probably as complex, if not more, as altnamespace, I'd say 
somewhere between $3000-$5000 as an initial estimate.  That's 30-50 
hours at a fairly cheap rate.

If people want to start pledging their support, perhaps enough 
incentive can be pooled.  If people don't feel comfortable doing this 
in public, then feel free to send me a private email.


On Wed, 15 Sep 2004, Ken Murchison wrote:
Date: Wed, 15 Sep 2004 11:44:45 -0400
From: Ken Murchison [EMAIL PROTECTED]
To: Paul Dekkers [EMAIL PROTECTED]
Cc: David Carter [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject: Re: Cyrus crashed on redundant platform - need better 
availability?

Paul Dekkers wrote:
David Carter wrote:
On Wed, 15 Sep 2004, Paul Dekkers wrote:
On the other hand, if there is a application level redundancy on 
its way, it doesn't really matter on what platform the machine 
runs, so it would still make me happier and even with FreeBSD. And 
I would rather put my money there. Even if it means we'll have to 
wait for some months,

I wouldn't hold out hope of anything being available in some months.
I wrote my replication code two years ago, and submitted it to Rob 
and Ken about this time last year. Neither I or they have put any 
significant work into the code since then. As I indicated in my 
previous message, we all have other priorities right now.

I can imagine, but I hoped that priorities would change a bit with 
the amount of users that repeatedly 
http://www.interglot.com/toclipboard.php?b=1d=2t=herhaaldelijks=herhaaldelijkw=repeatedlyshowed 

This link appears dead.  All I get is To clipboard.

interest in this feature and with the money we are willing to put in :-|

I'm willing to work on it if there is money available.  You are the 
only one that has says that you would commit money.  Where are the 
rest of the folks? Based on the number of people that stepped up to 
pay for virtdomains support (zero), I'm guessing there are fewer out 
there willing to spend money than you think.  But I could be wrong.

Other than the altnamespace project ($5000) that I did for a (unnamed) 
company in Texas, Jeremy Howard at Fastmail is the only one who has 
consistently paid for features.  I'll let him disclose what he has 
spent, if he chooses to, but its safe to say that its been more than 
just pizza and beer.

I'd have to look at David's patch again and discuss things with CMU to 
get a good time estimate, but I'm guessing that a project like this 
would cost a few thousand dollars.

--
Kenneth Murchison Oceana Matrix Ltd.
Software Engineer 21 Princeton Place
716-662-8973 x26  Orchard Park, NY 14127
--PGP Public Key--http://www.oceana.com/~ken/ksm.pgp
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


--
Kenneth Murchison Oceana Matrix Ltd.
Software Engineer 21 Princeton Place
716-662-8973 x26  Orchard Park, NY 14127
--PGP Public Key--http://www.oceana.com/~ken/ksm.pgp
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-15 Thread Gary Mills
On Wed, Sep 15, 2004 at 02:07:08PM -0400, Ken Murchison wrote:
 David Lang wrote:
 
 how much are you asking for?
 
 Since this is probably as complex, if not more, as altnamespace, I'd say 
 somewhere between $3000-$5000 as an initial estimate.  That's 30-50 
 hours at a fairly cheap rate.
 
 If people want to start pledging their support, perhaps enough 
 incentive can be pooled.  If people don't feel comfortable doing this 
 in public, then feel free to send me a private email.

I'm certainly interested in adding some redundancy to our Cyrus
installation.  I'm about to upgrade the hardware to a single Sun
V480 with 4 1200 MHz CPUs and 16 gigs of memory.  The two internal
disks will be mirrored, and contain only the OS files.  Everything
else will be on external RAID arrays.  The next expansion should
add more IMAP storage and provide redundancy in the case of software
or equipment failure.  I'm aware of Murder, but I'm not sure that
it's the best solution for us.

I don't control the funding, but I can recommend something.

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Paying for developers? (was: Re: Cyrus crashed on redundant platform - need better availability?)

2004-09-14 Thread Attila Nagy
Jure Pear wrote:
I still think that it would be best to have two filesystems instead of 
one, so with mirroring on application level (cyrus)... :-)
I'd rather see murder store a message on two sepparate machines ... Actually
to have duplicated mailboxes in sync over a pool of backend machines, with
murder taking care of backlogs when one of them would go down.
So many users cried for this feature (to provide not just horizontal 
scalability with murder, but to have redundant backends which can hold 
each others replicas too) that I wonder: if it's so important to us, the 
cyrus users, why don't we collect some money and pass it to the developers?

Maybe it could help to make the implementation real, and the developers 
have already demonstrated that they can design and code such things.

--
Attila Nagy   e-mail: [EMAIL PROTECTED]
Free Software Network (FSN.HU)   phone @work: +361 371 3536
ISOs: http://www.fsn.hu/?f=downloadcell.: +3630 306 6758
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Paying for developers? (was: Re: Cyrus crashed on redundant platform - need better availability?)

2004-09-14 Thread David G Mcmurtrie
On Tue, 14 Sep 2004, Attila Nagy wrote:

 So many users cried for this feature (to provide not just horizontal
 scalability with murder, but to have redundant backends which can hold
 each others replicas too) that I wonder: if it's so important to us, the
 cyrus users, why don't we collect some money and pass it to the developers?

I wasn't following this entire thread, but if I'm not mistaken David
Carter from the University of Cambridge already implemented what you're
looking for:

http://www-uxsup.csx.cam.ac.uk/~dpc22/cyrus/replication.html

Maybe if you collect some money you could send it to him :)

Thanks,

Dave

PGP/GPG Key:  http://www.pitt.edu/~dgm/gpgkey.asc.txt
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-13 Thread Luca Olivetti
Paul Dekkers wrote:

I'm not sure why the box crashed; there was nothing in the logs, there 
was nothing on the screen when we came there, it just booted up again. 
Of course I'm interested if anyone has any thoughts on this.
Maybe it has nothing to do with your problem, but there is a timing 
issue with some intel xeon and p4 processors. Look at this HP advisory:

http://tinyurl.com/63dxe
even if it says that no field issues have been identified, I've 
experienced real random lock ups before updating the bios.
Look if is there a bios update available from dell.

Bye
--
Luca Olivetti
Wetron Automatización S.A. http://www.wetron.es/
Tel. +34 93 5883004  Fax +34 93 5883007
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-11 Thread Paul Dekkers
Hi,
Michael Loftis wrote:
The theory only translates if you're using a JOURNALED file system. 
Linux ext3, reiserfs AIX JFS, Sun/others veritas are all examples 
of this. AFAIK FreeBSD hasn't any journalling file systems, 
Hmm, some say the use of softupdates preclude a journaling filesystem 
(see for instance http://kerneltrap.org/node.php?id=6). It's all a bit 
different with FreeBSD :-)

That said, the machine shouldn't' have crashed in the first place, but 
you are running 5.x which is clearly labeled as *NOT* production (4.10 
for that)... All of my produciton boxen are 4.x based (of the FreeBSD 
herd)
You are right - 5.x is not stable yet, but 5.3 is very close to it. 
Since 5.3 is coming we thought it would be easier to install with 5.2.1 
and upgrade to 5.3 rather then use 4.10 and upgrade that... And we might 
indeed face a 5.2.1-bug, that's why I mentioned the SMP, 3G and 
GEOM things, but might as well be something else with 5.x.
Something else that was a vote for 5.x was the filesystem; 4.10 does not 
have UFS2.

Apart from solving the issues we have with the machine I think we'd 
really look at the options for having redundany in application, as 
sketched in the High availability ... again subject :-) Maybe we'd 
install 5.3-BETA on the platform (I'll discuss it with another FreeBSD 
expert here :-))

Jure Pe?ar wrote:
The only high availability i see here is the google way. Cyrus is
offering you that with the 'murder' component.
 

That's not really availability, but distributed risk.
 

Exactly ... with murder taking care of keeping duplicated mailboxes in sync
over a pool of backend machines (as i mentioned in the other mail), this
would be perfect for all of us, i guess.
 

Well, I don't know if it needs to be murder or some extension in the 
storage or something, but that's roughly my idea indead; synchronising 
two (or more? that's harder maybe) servers, just like doing an imapsync 
or rsync, but then... well, better! :-) (And without losing states and 
so forth.)

The SPOF of the SCSI controller in the RAID box I'm willing to accept, 
but the filesystem is a bit harder.

I'm curious what cyrus developers think of this, and I'm interested in 
what we can do to help.

Paul
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-11 Thread David Lang
On Fri, 10 Sep 2004, Michael Loftis wrote:
Date: Fri, 10 Sep 2004 13:15:05 -0600
From: Michael Loftis [EMAIL PROTECTED]
To: Paul Dekkers [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject: Re: Cyrus crashed on redundant platform - need better availability?
The theory only translates if you're using a JOURNALED file system.  Linux 
ext3, reiserfs AIX JFS, Sun/others veritas are all examples of this. 
AFAIK FreeBSD hasn't any journalling file systems, i could be wrong though 
since I haven't really looked for one (my freebsd boxes just run...and 
run...and run...)  That said, the machine shouldn't' have crashed in the 
first place, but you are running 5.x which is clearly labeled as *NOT* 
production (4.10 for that)...  All of my produciton boxen are 4.x based (of 
the FreeBSD herd)

However even a Journaled filesystem won't protect you completely from 
corruption. even the filesystems you list can loose data when there is a 
crash and if one system goes haywire and starts scribbling on the shared 
disk it will trash any filesystem.

David Lang

--On Friday, September 10, 2004 13:24 +0200 Paul Dekkers 
[EMAIL PROTECTED] wrote:

Hi,
We're implementing a new mailplatform running on two dell 2650-servers (2
xeon cpu's with each 3 Ghz, HTT and 3Gb of memory) and with a disk array
of 4 Tb connected with a adaptec 39160 scsi controller for storage. We
installed FreeBSD 5.2.1 on it, and - of course - cyrus 2.2.8 (from the
ports) as IMAP server. Our MTA is postfix.
There are two machines for redundancy. If one fails, the other one should
take over: mount the disks from the array, and move on.
Unfortunally, the primary server crashed twice already. The first time it
did while synchronising two IMAP-spools from the old server to the new
one. There was not much data on it back then. The second time was worse,
around 10Gb of mail was stored on the disks. We discovered that the fsck
took about 30 minutes, so although we have two machines for redundancy it
takes still quite some time before the mail is available again. (And we
still have about 90 Gb of mail to migrate, so when all users are migrated
it takes much longer.)
I mounted the filesystems synchronous now: although it slows down the
system I hope it speeds up the fsck a bit when there is another crash.
The second crash was while removing a lot of mailboxes (dm) while some of
them where removed the same time using a webmail app (squirrelmail).
I'm not sure why the box crashed; there was nothing in the logs, there
was nothing on the screen when we came there, it just booted up again. Of
course I'm interested if anyone has any thoughts on this.
Although many on the list claim that this (having 2 boxes with 1
disk-array) is a nice way for redundancy I'm in doubt now if this is
true. It still takes 30 mins before everything is back again! It seems to
me that if there was a live version of cyrus available with a
synchronised mail-spool, that there was no outage noticeable for users
(except in losing a connection maybe). Am I right?
Maybe it's time to continue on the High availability ...
again-discussion we had a while ago. If the cyrus developers are able to
implement this with some funding there are still some questions left for
me: how much time would it take before a stable solution is ready? How
many funding is expected? I still have to talk to management about this,
but I would really support this development and I'm certainly willing to
convince some managers.
Regards,
Paul
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

--
Undocumented Features quote of the moment...
It's not the one bullet with your name on it that you
have to worry about; it's the twenty thousand-odd rounds
labeled `occupant.'
 --Murphy's Laws of Combat
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
--
There are two ways of constructing a software design. One way is to make it so simple 
that there are obviously no deficiencies. And the other way is to make it so 
complicated that there are no obvious deficiencies.
 -- C.A.R. Hoare
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-11 Thread Michael Loftis
BTW -- if you want Stable (in case you didn't understand that from ym 
previous mail) go back to FreeBSD 4.x (say 4.10-STABLE or -SECURE) -- 
you've probably run into a platform bug, not a bug in Cyrus, since the 
whole machine went.

---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-11 Thread Michael Loftis

--On Friday, September 10, 2004 16:27 +0200 Paul Dekkers 
[EMAIL PROTECTED] wrote:

What did the kernel improve? You are not using a clustered filesystem,
right?
RH kernels tend to coem up with bugs that noone else sees FYI (this is why 
my employer we're switching to Debian...)

Well, it's UFS2 with softupdates, so yes. I'm afraid the journal was
damaged in my case, there were serveral complaints while doing the fsck
about softupdate inconsistencies. (The server crashed once more but since
I mounted with -o sync now the fsck was much faster. I'll keep it that
way for now untill we know what's really wrong - it was again with a
large mail-folder synchronisation...)
FWIW I can't call soft updates a journal.  9/10 times when i have had a 
crash, the soft updates journal either was corrupt, inconsistent, or made 
things worse.  When running with soft updates many times I'd lose many days 
worth of mail on a restart.

Hmm, I don't expect the problems to be SCSI-related. Maybe it has to do
with GEOM and SMP in FreeBSD 5.2.1, but not the SCSI-bus itself. (There
are two seperate controllers for both machines, they never see each other
on the same SCSI bus...)
Probably not, more likely something funkish in FBSD 5.2.1
I still think that it would be best to have two filesystems instead of
one, so with mirroring on application level (cyrus)... :-)
I tend to agree
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-11 Thread Jure Pear
On Fri, 10 Sep 2004 16:27:40 +0200
Paul Dekkers [EMAIL PROTECTED] wrote:
 
 Sebastian Hagedorn wrote:

  Right, works fine for us for the most part. Hasn't always been like 
  that, but the most recent kernel updates by Red Hat have improved 
  matters a lot.
 
 What did the kernel improve? You are not using a clustered filesystem, 
 right?

The kernel that shipped with RedHat AS 2.1 was useless for most of the tasks
i tried it with. About three revisions later it became somewhat more usefull
for non-oracle types of use, but i've rolled my own and am not following the
state of it now.

  It's good but not perfect. We recently installed a huge SAN and are 
  now in the process of moving over the mail data to reside there. 
  Fibrechannel seems to be much more error tolerant than SCSI.

I haven't had problems with the fiber itself, i've only had lots of fun with
the firmware on the disks themselves and some with the qlogic drivers. 

 I still think that it would be best to have two filesystems instead of 
 one, so with mirroring on application level (cyrus)... :-)

I'd rather see murder store a message on two sepparate machines ... Actually
to have duplicated mailboxes in sync over a pool of backend machines, with
murder taking care of backlogs when one of them would go down.


-- 

Jure Pear
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-11 Thread Jure Pear
On Fri, 10 Sep 2004 16:32:33 +0200
Paul Dekkers [EMAIL PROTECTED] wrote:

 Hmm, then your fscks will run faster/with less problems, but there is 
 still outage that you can prevent if there is failover in another way 
 and availability/replication on the application level.
 If there are replicated spools it doesn't matter if the fsck takes long 
 or not... although there will be a backlog of course.

Yes, but right now there are no replicated spools on the app level so i'm
doing the best i can as a sysadmin :)

 Is it possible to have an fsck running on one partition and have cyrus 
 started already (so part of the mail-store, e.g. archives, is not 
 available yet?)

Not that i know ... i guess cyrus would be spewing lots of i/o errors back
at you for the mailboxes that are on that fscking partition ;)

 The only high availability i see here is the google way. Cyrus is
 offering you that with the 'murder' component.
 
 That's not really availability, but distributed risk.

Exactly ... with murder taking care of keeping duplicated mailboxes in sync
over a pool of backend machines (as i mentioned in the other mail), this
would be perfect for all of us, i guess.

 BTW, you're mentioning FreeBSD ... doesn't it have some sort of
 background fsck while the filesystem is moutned rw? 
 
 It can, but I'm not sure if that's what I prefer. I'm not sure how 
 mature it is with FreeBSD, and I prefer to have mail-integrety over a 
 quick restore.

I can't speak about maturity of a certain FreeBSD component as i'm a linux
guy, but what i hear it should just work.

-- 

Jure Pear
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-11 Thread Michael Loftis
The theory only translates if you're using a JOURNALED file system.  Linux 
ext3, reiserfs AIX JFS, Sun/others veritas are all examples of this. 
AFAIK FreeBSD hasn't any journalling file systems, i could be wrong though 
since I haven't really looked for one (my freebsd boxes just run...and 
run...and run...)  That said, the machine shouldn't' have crashed in the 
first place, but you are running 5.x which is clearly labeled as *NOT* 
production (4.10 for that)...  All of my produciton boxen are 4.x based (of 
the FreeBSD herd)


--On Friday, September 10, 2004 13:24 +0200 Paul Dekkers 
[EMAIL PROTECTED] wrote:

Hi,
We're implementing a new mailplatform running on two dell 2650-servers (2
xeon cpu's with each 3 Ghz, HTT and 3Gb of memory) and with a disk array
of 4 Tb connected with a adaptec 39160 scsi controller for storage. We
installed FreeBSD 5.2.1 on it, and - of course - cyrus 2.2.8 (from the
ports) as IMAP server. Our MTA is postfix.
There are two machines for redundancy. If one fails, the other one should
take over: mount the disks from the array, and move on.
Unfortunally, the primary server crashed twice already. The first time it
did while synchronising two IMAP-spools from the old server to the new
one. There was not much data on it back then. The second time was worse,
around 10Gb of mail was stored on the disks. We discovered that the fsck
took about 30 minutes, so although we have two machines for redundancy it
takes still quite some time before the mail is available again. (And we
still have about 90 Gb of mail to migrate, so when all users are migrated
it takes much longer.)
I mounted the filesystems synchronous now: although it slows down the
system I hope it speeds up the fsck a bit when there is another crash.
The second crash was while removing a lot of mailboxes (dm) while some of
them where removed the same time using a webmail app (squirrelmail).
I'm not sure why the box crashed; there was nothing in the logs, there
was nothing on the screen when we came there, it just booted up again. Of
course I'm interested if anyone has any thoughts on this.
Although many on the list claim that this (having 2 boxes with 1
disk-array) is a nice way for redundancy I'm in doubt now if this is
true. It still takes 30 mins before everything is back again! It seems to
me that if there was a live version of cyrus available with a
synchronised mail-spool, that there was no outage noticeable for users
(except in losing a connection maybe). Am I right?
Maybe it's time to continue on the High availability ...
again-discussion we had a while ago. If the cyrus developers are able to
implement this with some funding there are still some questions left for
me: how much time would it take before a stable solution is ready? How
many funding is expected? I still have to talk to management about this,
but I would really support this development and I'm certainly willing to
convince some managers.
Regards,
Paul
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

--
Undocumented Features quote of the moment...
It's not the one bullet with your name on it that you
have to worry about; it's the twenty thousand-odd rounds
labeled `occupant.'
  --Murphy's Laws of Combat
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-10 Thread Jure Pear
On Fri, 10 Sep 2004 13:24:42 +0200
Paul Dekkers [EMAIL PROTECTED] wrote:

 Although many on the list claim that this (having 2 boxes with 1 
 disk-array) is a nice way for redundancy I'm in doubt now if this is 
 true. It still takes 30 mins before everything is back again! It seems 
 to me that if there was a live version of cyrus available with a 
 synchronised mail-spool, that there was no outage noticeable for users 
 (except in losing a connection maybe). Am I right?

Having 2 boxes with one disk array leaves you wit a single point of failure
that you wouldn't think of immediately: filesystem. I learned that the hard
way. 
I'm planning to 'redesign' our storage: instead of one big volume that fscks
for hours, i'm going to split in in many mirrors and use them as cyrus
partitions. This way they could all fsck in parrallel. I'm going to lose the
'single instance store' capability, but thats a tradeoff that i'm willing to
take.

It happened to me at least once that the machine that crashed corrupted the
filesystem in a way that the machine that took over also crashed within
hours...
 
 Maybe it's time to continue on the High availability ... 
 again-discussion we had a while ago. If the cyrus developers are able 
 to implement this with some funding there are still some questions left 
 for me: how much time would it take before a stable solution is ready? 
 How many funding is expected? I still have to talk to management about 
 this, but I would really support this development and I'm certainly 
 willing to convince some managers.

The only high availability i see here is the google way. Cyrus is offering
you that with the 'murder' component.


BTW, you're mentioning FreeBSD ... doesn't it have some sort of background
fsck while the filesystem is moutned rw? 


-- 

Jure Pear
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-10 Thread Sebastian Hagedorn
Hi,
--On Freitag, 10. September 2004 13:24 Uhr +0200 Paul Dekkers 
[EMAIL PROTECTED] wrote:

We're implementing a new mailplatform running on two dell 2650-servers (2
xeon cpu's with each 3 Ghz, HTT and 3Gb of memory) and with a disk array
of 4 Tb connected with a adaptec 39160 scsi controller for storage. We
installed FreeBSD 5.2.1 on it, and - of course - cyrus 2.2.8 (from the
ports) as IMAP server. Our MTA is postfix.
that's similar to our setup, be we are currently running Red Hat Advanced 
Server 2.1, Cyrus 2.1.16 and sendmail.

There are two machines for redundancy. If one fails, the other one should
take over: mount the disks from the array, and move on.
Right, works fine for us for the most part. Hasn't always been like that, 
but the most recent kernel updates by Red Hat have improved matters a lot.

Unfortunally, the primary server crashed twice already. The first time it
did while synchronising two IMAP-spools from the old server to the new
one. There was not much data on it back then. The second time was worse,
around 10Gb of mail was stored on the disks. We discovered that the fsck
took about 30 minutes,
Isn't your filesystem journaled? We use ext3 for ours. There *have* been a 
few occasions where the journal had been damaged as well (forcing us to run 
fsck), but those have been few and far between. In all other instances the 
failover is nearly instantaneous.

Although many on the list claim that this (having 2 boxes with 1
disk-array) is a nice way for redundancy I'm in doubt now if this is
true.
It's good but not perfect. We recently installed a huge SAN and are now in 
the process of moving over the mail data to reside there. Fibrechannel 
seems to be much more error tolerant than SCSI.

Cheers, Sebastian Hagedorn
--
Sebastian Hagedorn M.A. - RZKR-R1 (Gebäude 52), Zimmer 18
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
Universität zu Köln / Cologne University - Tel. +49-221-478-5587

pgprNRdYpcyzG.pgp
Description: PGP signature


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-10 Thread Paul Dekkers
Hi,
Sebastian Hagedorn wrote:
There are two machines for redundancy. If one fails, the other one 
should
take over: mount the disks from the array, and move on.
Right, works fine for us for the most part. Hasn't always been like 
that, but the most recent kernel updates by Red Hat have improved 
matters a lot.
What did the kernel improve? You are not using a clustered filesystem, 
right?

Unfortunally, the primary server crashed twice already. The first 
time it
did while synchronising two IMAP-spools from the old server to the new
one. There was not much data on it back then. The second time was worse,
around 10Gb of mail was stored on the disks. We discovered that the fsck
took about 30 minutes,
Isn't your filesystem journaled? We use ext3 for ours. There *have* 
been a few occasions where the journal had been damaged as well 
(forcing us to run fsck), but those have been few and far between. In 
all other instances the failover is nearly instantaneous.
Well, it's UFS2 with softupdates, so yes. I'm afraid the journal was 
damaged in my case, there were serveral complaints while doing the fsck 
about softupdate inconsistencies. (The server crashed once more but 
since I mounted with -o sync now the fsck was much faster. I'll keep it 
that way for now untill we know what's really wrong - it was again with 
a large mail-folder synchronisation...)

Although many on the list claim that this (having 2 boxes with 1
disk-array) is a nice way for redundancy I'm in doubt now if this is
true.
It's good but not perfect. We recently installed a huge SAN and are 
now in the process of moving over the mail data to reside there. 
Fibrechannel seems to be much more error tolerant than SCSI.
Hmm, I don't expect the problems to be SCSI-related. Maybe it has to do 
with GEOM and SMP in FreeBSD 5.2.1, but not the SCSI-bus itself. (There 
are two seperate controllers for both machines, they never see each 
other on the same SCSI bus...)

I still think that it would be best to have two filesystems instead of 
one, so with mirroring on application level (cyrus)... :-)

Paul
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Cyrus crashed on redundant platform - need better availability?

2004-09-10 Thread Paul Dekkers
Jure Pear wrote:
Although many on the list claim that this (having 2 boxes with 1 
disk-array) is a nice way for redundancy I'm in doubt now if this is 
true. It still takes 30 mins before everything is back again! It seems 
to me that if there was a live version of cyrus available with a 
synchronised mail-spool, that there was no outage noticeable for users 
(except in losing a connection maybe). Am I right?
   

Having 2 boxes with one disk array leaves you wit a single point of failure
that you wouldn't think of immediately: filesystem. I learned that the hard
way.
 

Yes, I agree.
I'm planning to 'redesign' our storage: instead of one big volume that fscks
for hours, i'm going to split in in many mirrors and use them as cyrus
partitions. This way they could all fsck in parrallel. I'm going to lose the
'single instance store' capability, but thats a tradeoff that i'm willing to
take.
 

Hmm, then your fscks will run faster/with less problems, but there is 
still outage that you can prevent if there is failover in another way 
and availability/replication on the application level.
If there are replicated spools it doesn't matter if the fsck takes long 
or not... although there will be a backlog of course.

Is it possible to have an fsck running on one partition and have cyrus 
started already (so part of the mail-store, e.g. archives, is not 
available yet?)

It happened to me at least once that the machine that crashed corrupted the
filesystem in a way that the machine that took over also crashed within
hours... 
 

Maybe it's time to continue on the High availability ... 
again-discussion we had a while ago. If the cyrus developers are able 
to implement this with some funding there are still some questions left 
for me: how much time would it take before a stable solution is ready? 
How many funding is expected? I still have to talk to management about 
this, but I would really support this development and I'm certainly 
willing to convince some managers.
   

The only high availability i see here is the google way. Cyrus is offering
you that with the 'murder' component.
 

That's not really availability, but distributed risk.
BTW, you're mentioning FreeBSD ... doesn't it have some sort of background
fsck while the filesystem is moutned rw? 
 

It can, but I'm not sure if that's what I prefer. I'm not sure how 
mature it is with FreeBSD, and I prefer to have mail-integrety over a 
quick restore.

Paul

---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html