Re: Another 2.4 upgrade horror story
On Sep 25, 2012, at 11:57 AM, Deniss cy...@sad.lv wrote: On 25.09.2012 15:28, Eric Luyten wrote: On Tue, September 25, 2012 2:01 pm, Sebastian Hagedorn wrote: Hi, about three weeks ago we upgraded our Cyrus installation from 2.3.x to 2.4.16. We were aware of the reindexing issue, so we took precautionary measures, but they didn't help a lot. We've got about 7 TB of mail data for almost 200,000 mailboxes. We did the upgrade on a Sunday and had told our users that mail access wouldn't be possible for the whole day. After the actual software upgrade we ran distributed scripts that triggered the index upgrades. We started with the largest mailboxes. The idea was that after those that took the longest had been upgraded, the rest should be OK overnight and early Monday. However, even though our storage infrastructure was kept at 99 % I/O saturation, progress was much slower than anticipated. Ultimately the server was virtually unuseable for the whole Monday and parts of Tuesday. The last mailbox was finally upgraded on Thursday, although on Wednesday most things were already working normally. I realize that some of our problems were caused by infrastructure that's not up to current standards, but nonetheless I would really urge you to never again use an upgrade mechanism like that. Give admins a chance to upgrade indexes in the background and over time. +1 Sebastian, Thank you for sharing your experiences. As a site willing/needing to upgrade from 2.3.16 to 2.4.X this fall, we are interested in learning about your storage backend characteristics. What read/write IOPS rates were you registering before/during/after your upgrade process ? I'd understand your reluctance to share this information in a public forum. No offence taken whatsoever ! Kind regards, Eric Luyten, Computing Centre VUB/ULB, eric.luy...@vub.ac.be migration process from 2.3 to 2.4 took ~ one year for our installation. we converted ~200Tb of users data. first step we did - spread data on many nodes using cyrus replication. next we started converting nodes one by one at weekends nights to minimize IO load generated by users. in fact cyrus read all data from disk to generate new indexes, so convert is limited by disk IO mainly while CPU is pretty cheap nowadays. we got around 500Gb in 8 hours rate for forced reindex with 100% disk load. we started forced reindex with most active users meanwhile allowing users to login and trigger reindex of their mailboxes Sorry for hi-jacking this thread, but I'm curious as to the preferred method of forcing a reindex on a mailbox? I know it triggers when a user logs in and accesses the mailbox. I would like to divide up users and perform the reindex in chunks. Thanks, Bryan --- Bryan D. Hill UCSD Physics Computing Facility CTBP Systems Support 9500 Gilman Dr. # 0319 La Jolla, CA 92093 +1-858-534-5538 bh...@ucsd.edu AIM: pozvibesd Web: http://www.physics.ucsd.edu/pcf Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Another 2.4 upgrade horror story
On Sun, September 30, 2012 6:47 pm, Bryan Hill wrote: ... Sorry for hi-jacking this thread, but I'm curious as to the preferred method of forcing a reindex on a mailbox? I know it triggers when a user logs in and accesses the mailbox. I would like to divide up users and perform the reindex in chunks. Bryan, We found out that a Cyrus quota fix (quota -f ...) only regenerates metadata for old format mailboxes, whereas reconstruct -r does 'em all, also the already-converted. Cheers, Eric Luyten, Computing Centre VUB/ULB. Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Another 2.4 upgrade horror story
On Sun, Sep 30, 2012, at 09:46 PM, Eric Luyten wrote: On Sun, September 30, 2012 6:47 pm, Bryan Hill wrote: ... Sorry for hi-jacking this thread, but I'm curious as to the preferred method of forcing a reindex on a mailbox? I know it triggers when a user logs in and accesses the mailbox. I would like to divide up users and perform the reindex in chunks. Bryan, We found out that a Cyrus quota fix (quota -f ...) only regenerates metadata for old format mailboxes, whereas reconstruct -r does 'em all, also the already-converted. cyr_expire touches all the mailboxes in order... Bron. -- Bron Gondwana br...@fastmail.fm Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Another 2.4 upgrade horror story
On Sep 30, 2012, at 1:36 PM, Bron Gondwana br...@fastmail.fm wrote: On Sun, Sep 30, 2012, at 09:46 PM, Eric Luyten wrote: On Sun, September 30, 2012 6:47 pm, Bryan Hill wrote: ... Sorry for hi-jacking this thread, but I'm curious as to the preferred method of forcing a reindex on a mailbox? I know it triggers when a user logs in and accesses the mailbox. I would like to divide up users and perform the reindex in chunks. Bryan, We found out that a Cyrus quota fix (quota -f ...) only regenerates metadata for old format mailboxes, whereas reconstruct -r does 'em all, also the already-converted. cyr_expire touches all the mailboxes in order... Bron. -- Bron Gondwana br...@fastmail.fm Ah, nice. I'll look at this too. Thanks Bron! Thanks, Bryan --- Bryan D. Hill UCSD Physics Computing Facility CTBP Systems Support 9500 Gilman Dr. # 0319 La Jolla, CA 92093 +1-858-534-5538 bh...@ucsd.edu AIM: pozvibesd Web: http://www.physics.ucsd.edu/pcf Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Another 2.4 upgrade horror story
Hi Eric, --On 25. September 2012 14:28:03 +0200 Eric Luyten eric.luy...@vub.ac.be wrote: Thank you for sharing your experiences. As a site willing/needing to upgrade from 2.3.16 to 2.4.X this fall, we are interested in learning about your storage backend characteristics. What read/write IOPS rates were you registering before/during/after your upgrade process ? I'd understand your reluctance to share this information in a public forum. No offence taken whatsoever ! no problem, it just took me a while to gather the information. Our backends are IBM DS4300s. Some of the disks are 73 GB 15k RPM Fibre Channel Disks (RAID 5), others are 146 GB 10k RPM Fibre Channel Disks. The SAN controllers are IBM SVCs (model 2145 8F4). The load is balanced (though not evenly) over four disk controllers. According to our storage guy we see 600 IOps/s on average, with peaks up to 2,000-3,000 under normal circumstances. During the migration we saw 10,000 IOps/s per controller, 40,000 in sum. Hope this helps Sebastian -- .:.Sebastian Hagedorn - RZKR-W (Gebäude 133), Zimmer 2.02.:. .:.Regionales Rechenzentrum (RRZK).:. .:.Universität zu Köln / Cologne University - ✆ +49-221-470-89578.:. p7sVI2BLiozHf.p7s Description: S/MIME cryptographic signature Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Another 2.4 upgrade horror story
On Wed, Sep 26, 2012, at 01:25 AM, Wolfgang Breyha wrote: On 2012-09-25 19:05, Simon Beale wrote: The only gotcha I experienced was I forgot that cyrus was configured to hardlink mail, which of course was no longer the case after each mailbox was migrated, so my disk usage exploded. (But easily fixed/restored once identified). What did you use for restoring the hardlinks? freedup as well? I'm asking because I found a bug in freedup causing dataloss. I already sent a patch fixing it to the author of freedup last november, but he didn't release a new version yet. I have a script for doing it - though only within a single user... This is the core of the link logic: print fixing up files for $guid ($srcname)\n; foreach my $file (@others) { my $tmpfile = $file . tmp; print link error $tmpfile\n unless link($srcname, $tmpfile); chown($uid, $gid, $tmpfile); chmod(0600, $tmpfile); print rename error $file\n unless rename($tmpfile, $file); } I suspect your fixup is similar :) Bron. -- Bron Gondwana br...@fastmail.fm Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Another 2.4 upgrade horror story
Hi! On Tue, Sep 25, 2012 at 05:55:27PM +0200, Wolfgang Breyha wrote: Time was not the limiting factor for us. Availability and safety of our mailboxes was. Nobody can guarantee you that the migration works out flawlessly. If moving one mailbox fails I have troubles with exactly one mailbox. If reconstruction of one mailbox fails while migrating hard from 2.3 to 2.4 your system is down even longer. And we had a couple of problematic mailboxes as partly documented on this mailinglist;-) Um. Isn't it still only that one mailbox that couldn't be reconstructed that's unaccessible, with the system as a whole up and running and everything else accessible and working? Or am I missing something? I mean, each mailbox gets recontsructed separately. And by having tested the migration, we had reason to believe most if not all of the mailboxes would get reconstructed OK, so we didn't expect that any significant number of users would see any trouble except maybe slowness for a short while. As it happened, we didn't have any trouble with the reconstruction of any mailboxes, the process went flawlessly - it was just slow. Also, it did cause observable slowness only because at first, I hadn't reconstructed the mailboxes systematically but had trusted the system to do it on demand. After the business day and daily peak usage was over, I set up such a number of reconstruction processes that the system didn't choke on them, and everything (~60k mailboxes) was reconstructed before next morning. In detail I *) check for active sessions on the mailbox *) lock the mailbox by denying access via userdeny.db *) move the mailbox *) unlock it Yeah, that process is identical to the one we use to move mailboxes between backends to balance their disk usage. (The mailbox-moving part of the process, that is; the algorithm to decide how many mailboxes to move from where and whence is another matter.) --Janne -- Janne Peltonen janne.pelto...@helsinki.fi PGP Key ID: 0x9CFAC88B Please consider membership of the Hospitality Club (http://www.hospitalityclub.org) Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Another 2.4 upgrade horror story
--On 26. September 2012 09:42:17 +0300 Janne Peltonen janne.pelto...@helsinki.fi wrote: As it happened, we didn't have any trouble with the reconstruction of any mailboxes, the process went flawlessly - it was just slow. Same here. Also, it did cause observable slowness only because at first, I hadn't reconstructed the mailboxes systematically but had trusted the system to do it on demand. After the business day and daily peak usage was over, I set up such a number of reconstruction processes that the system didn't choke on them, and everything (~60k mailboxes) was reconstructed before next morning. We did it systematically from the start, but that wasn't enough. -- .:.Sebastian Hagedorn - RZKR-W (Gebäude 133), Zimmer 2.02.:. .:.Regionales Rechenzentrum (RRZK).:. .:.Universität zu Köln / Cologne University - ✆ +49-221-470-89578.:. p7srJlsQUqHxg.p7s Description: S/MIME cryptographic signature Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Re: Another 2.4 upgrade horror story
Hi, I've got questions regarding the procedure you describe. I'm trying to wrap my head around the various possible approaches to replication and clustering. --On 25. September 2012 21:57:49 +0300 Deniss cy...@sad.lv wrote: migration process from 2.3 to 2.4 took ~ one year for our installation. we converted ~200Tb of users data. first step we did - spread data on many nodes using cyrus replication. The official documentation for replication seems to be this one: http://cyrusimap.web.cmu.edu/docs/cyrus-imapd/2.4.0/install-replication.php The way I read that, replication is all or nothing. So did each of the nodes have the whole 200 TB? If not, how did you achieve that? Did you have a murder with multiple backends to begin with? next we started converting nodes one by one at weekends nights to minimize IO load generated by users. How does replication work across Cyrus versions? I assume it wouldn't have been possible to create a new 2.4 replica from an existing 2.3 master? Thanks, Sebastian -- .:.Sebastian Hagedorn - RZKR-W (Gebäude 133), Zimmer 2.02.:. .:.Regionales Rechenzentrum (RRZK).:. .:.Universität zu Köln / Cologne University - ✆ +49-221-470-89578.:. p7sDHOgEW45Y1.p7s Description: S/MIME cryptographic signature Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Another 2.4 upgrade horror story
On 2012.09.26. 10:24, Sebastian Hagedorn wrote: Hi, I've got questions regarding the procedure you describe. I'm trying to wrap my head around the various possible approaches to replication and clustering. --On 25. September 2012 21:57:49 +0300 Deniss cy...@sad.lv wrote: migration process from 2.3 to 2.4 took ~ one year for our installation. we converted ~200Tb of users data. first step we did - spread data on many nodes using cyrus replication. The official documentation for replication seems to be this one: http://cyrusimap.web.cmu.edu/docs/cyrus-imapd/2.4.0/install-replication.php The way I read that, replication is all or nothing. So did each of the nodes have the whole 200 TB? If not, how did you achieve that? Did you have a murder with multiple backends to begin with? Our system's design allows us to seamless move mailboxes across cyrus backends one by one using sync_client. We have no murder. Each node had relatively small list of mailboxes when we started convert on it. After convert we aggregate mailboxes back. next we started converting nodes one by one at weekends nights to minimize IO load generated by users. How does replication work across Cyrus versions? I assume it wouldn't have been possible to create a new 2.4 replica from an existing 2.3 master? it not works Thanks, Sebastian Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Another 2.4 upgrade horror story
Simon Matter wrote, on 26.09.2012 06:58: I have not used freedup for restoring the hardlinks but I'm interested in the patch. If it's not big could you post it here? Sure! attached. I built a RPM based on http://pkgs.repoforge.org/freedup/freedup-1.5.3-1.rf.src.rpm and latest source. But building based on http://www.freedup.org/freedup-1.6-2.src.rpm should work as well. Greetings, Wolfgang -- Wolfgang Breyha wbre...@gmx.net | http://www.blafasel.at/ Vienna University Computer Center | Austria --- freedup-1.6/freedup.c.orig 2011-02-04 08:22:15.0 +0100 +++ freedup-1.6/freedup.c 2011-11-11 10:52:24.788733835 +0100 @@ -613,7 +613,7 @@ */ if( mktemp(tmpfilename) == NULL ) { - perror(There is no unique temporory file name.); + perror(There is no unique temporary file name.); } if( dirmtime!=0 ) { @@ -628,7 +628,7 @@ if( lstat(tmpfilename,tstat) != 0 ) { /* - * The errror needs not to be catched, since it is wanted + * The error needs not to be catched, since it is wanted * that no file exists with the target name */ rename( bname, tmpfilename ); @@ -643,12 +643,23 @@ } if( lnk( symaname, bname ) != 0 ) { - perror(Linking failed.); - } - if( unlink( tmpfilename ) != 0 ) - { - perror(Unlink failed.); - } + // linking failed! try to move original in place again and + // log that fact + fprintf(stderr, Linking failed. Trying roleback: \%s\, bname); + if ( rename( tmpfilename, bname ) != 0 ) + { + // moving old file in place again failed. + // at least log that -v + fprintf(stderr, unable to rename: \%s\, tmpfilename); + } + } + else + // unlink renamed original only of linking was successful + if( unlink( tmpfilename ) != 0 ) + { + // unlinking failed! log that -v + fprintf(stderr, Unlink failed: \%s\, tmpfilename); + } if( (dirmtime!=0) (gotdirtime!=0) ) { utimecache.actime = dstat.st_atime; Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Another 2.4 upgrade horror story
Hi, about three weeks ago we upgraded our Cyrus installation from 2.3.x to 2.4.16. We were aware of the reindexing issue, so we took precautionary measures, but they didn't help a lot. We've got about 7 TB of mail data for almost 200,000 mailboxes. We did the upgrade on a Sunday and had told our users that mail access wouldn't be possible for the whole day. After the actual software upgrade we ran distributed scripts that triggered the index upgrades. We started with the largest mailboxes. The idea was that after those that took the longest had been upgraded, the rest should be OK overnight and early Monday. However, even though our storage infrastructure was kept at 99 % I/O saturation, progress was much slower than anticipated. Ultimately the server was virtually unuseable for the whole Monday and parts of Tuesday. The last mailbox was finally upgraded on Thursday, although on Wednesday most things were already working normally. I realize that some of our problems were caused by infrastructure that's not up to current standards, but nonetheless I would really urge you to never again use an upgrade mechanism like that. Give admins a chance to upgrade indexes in the background and over time. Sebastian -- .:.Sebastian Hagedorn - RZKR-W (Gebäude 133), Zimmer 2.02.:. .:.Regionales Rechenzentrum (RRZK).:. .:.Universität zu Köln / Cologne University - ✆ +49-221-470-89578.:. p7s6Xo4Kvaea1.p7s Description: S/MIME cryptographic signature Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Another 2.4 upgrade horror story
On Tue, September 25, 2012 2:01 pm, Sebastian Hagedorn wrote: Hi, about three weeks ago we upgraded our Cyrus installation from 2.3.x to 2.4.16. We were aware of the reindexing issue, so we took precautionary measures, but they didn't help a lot. We've got about 7 TB of mail data for almost 200,000 mailboxes. We did the upgrade on a Sunday and had told our users that mail access wouldn't be possible for the whole day. After the actual software upgrade we ran distributed scripts that triggered the index upgrades. We started with the largest mailboxes. The idea was that after those that took the longest had been upgraded, the rest should be OK overnight and early Monday. However, even though our storage infrastructure was kept at 99 % I/O saturation, progress was much slower than anticipated. Ultimately the server was virtually unuseable for the whole Monday and parts of Tuesday. The last mailbox was finally upgraded on Thursday, although on Wednesday most things were already working normally. I realize that some of our problems were caused by infrastructure that's not up to current standards, but nonetheless I would really urge you to never again use an upgrade mechanism like that. Give admins a chance to upgrade indexes in the background and over time. +1 Sebastian, Thank you for sharing your experiences. As a site willing/needing to upgrade from 2.3.16 to 2.4.X this fall, we are interested in learning about your storage backend characteristics. What read/write IOPS rates were you registering before/during/after your upgrade process ? I'd understand your reluctance to share this information in a public forum. No offence taken whatsoever ! Kind regards, Eric Luyten, Computing Centre VUB/ULB, eric.luy...@vub.ac.be Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Another 2.4 upgrade horror story
Can you tell us more about your storage configuration? Ben -- Ben Carter University of Pittsburgh/CSSD b...@pitt.edu 412-624-6470 Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Another 2.4 upgrade horror story
Sebastian Hagedorn wrote, on 25.09.2012 14:01: I realize that some of our problems were caused by infrastructure that's not up to current standards, but nonetheless I would really urge you to never again use an upgrade mechanism like that. Give admins a chance to upgrade indexes in the background and over time. There is such an upgrade path using a murder environment and moving mailboxes between backends. We used that for our 150k user infrastructure and had no IO headaches at all. It was a good moment to update distribution, filesystems, hardware, as well. I tested conversion speed from 2.3 to 2.4 on about 100 mailboxes and it was pretty obvious that touching all our mailboxes in one shot is clearly impossible without unreasonable downtime. Greetings, Wolfgang -- Wolfgang Breyha wbre...@gmx.net | http://www.blafasel.at/ Vienna University Computer Center | Austria Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Another 2.4 upgrade horror story
Hi! On Tue, Sep 25, 2012 at 03:25:45PM +0200, Wolfgang Breyha wrote: Sebastian Hagedorn wrote, on 25.09.2012 14:01: I realize that some of our problems were caused by infrastructure that's not up to current standards, but nonetheless I would really urge you to never again use an upgrade mechanism like that. Give admins a chance to upgrade indexes in the background and over time. There is such an upgrade path using a murder environment and moving mailboxes between backends. We used that for our 150k user infrastructure and had no IO headaches at all. It was a good moment to update distribution, filesystems, hardware, as well. Could you elaborate on that? I considered that option, but seeing as moving even a couple dozen users from a backend to another using RENAME takes hours and one backend contains thousands of users, I decided to just live with the ~1 day of unbearable slowness. Or do you know of a fast way? --Janne -- Janne Peltonen janne.pelto...@helsinki.fi PGP Key ID: 0x9CFAC88B Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Another 2.4 upgrade horror story
Janne Peltonen wrote, on 25.09.2012 15:34: Could you elaborate on that? I considered that option, but seeing as moving even a couple dozen users from a backend to another using RENAME takes hours and one backend contains thousands of users, I decided to just live with the ~1 day of unbearable slowness. Or do you know of a fast way? Time was not the limiting factor for us. Availability and safety of our mailboxes was. Nobody can guarantee you that the migration works out flawlessly. If moving one mailbox fails I have troubles with exactly one mailbox. If reconstruction of one mailbox fails while migrating hard from 2.3 to 2.4 your system is down even longer. And we had a couple of problematic mailboxes as partly documented on this mailinglist;-) We had 5 backend stores and have 6 now. Complete migration of our 150k users with ~22TB took about 3 or 4 month. Including building the machines, requesting storage, etc... But you can't have less headache then moving mailboxes from backend to backend. In detail I *) check for active sessions on the mailbox *) lock the mailbox by denying access via userdeny.db *) move the mailbox *) unlock it Greetings, Wolfgang -- Wolfgang Breyha wbre...@gmx.net | http://www.blafasel.at/ Vienna University Computer Center | Austria Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Another 2.4 upgrade horror story
Hi! On Tue, Sep 25, 2012 at 03:25:45PM +0200, Wolfgang Breyha wrote: Sebastian Hagedorn wrote, on 25.09.2012 14:01: I realize that some of our problems were caused by infrastructure that's not up to current standards, but nonetheless I would really urge you to never again use an upgrade mechanism like that. Give admins a chance to upgrade indexes in the background and over time. There is such an upgrade path using a murder environment and moving mailboxes between backends. We used that for our 150k user infrastructure and had no IO headaches at all. It was a good moment to update distribution, filesystems, hardware, as well. Could you elaborate on that? I considered that option, but seeing as moving even a couple dozen users from a backend to another using RENAME takes hours and one backend contains thousands of users, I decided to just live with the ~1 day of unbearable slowness. Or do you know of a fast way? I did the migration by moving mailboxes between backends of 1TB each, having scripted it up to only move employees when it was 00:00 - 06:00 in their local timezone, and left the script running on each v2.3 backend for a few days. Took a few weeks in all to migrate and upgrade all our backends in turn, but no one experienced any downtime. The only gotcha I experienced was I forgot that cyrus was configured to hardlink mail, which of course was no longer the case after each mailbox was migrated, so my disk usage exploded. (But easily fixed/restored once identified). It comes down to having spare backend(s) to move people on to, and time to do it patiently. Simon Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Re: Another 2.4 upgrade horror story
On 25.09.2012 15:28, Eric Luyten wrote: On Tue, September 25, 2012 2:01 pm, Sebastian Hagedorn wrote: Hi, about three weeks ago we upgraded our Cyrus installation from 2.3.x to 2.4.16. We were aware of the reindexing issue, so we took precautionary measures, but they didn't help a lot. We've got about 7 TB of mail data for almost 200,000 mailboxes. We did the upgrade on a Sunday and had told our users that mail access wouldn't be possible for the whole day. After the actual software upgrade we ran distributed scripts that triggered the index upgrades. We started with the largest mailboxes. The idea was that after those that took the longest had been upgraded, the rest should be OK overnight and early Monday. However, even though our storage infrastructure was kept at 99 % I/O saturation, progress was much slower than anticipated. Ultimately the server was virtually unuseable for the whole Monday and parts of Tuesday. The last mailbox was finally upgraded on Thursday, although on Wednesday most things were already working normally. I realize that some of our problems were caused by infrastructure that's not up to current standards, but nonetheless I would really urge you to never again use an upgrade mechanism like that. Give admins a chance to upgrade indexes in the background and over time. +1 Sebastian, Thank you for sharing your experiences. As a site willing/needing to upgrade from 2.3.16 to 2.4.X this fall, we are interested in learning about your storage backend characteristics. What read/write IOPS rates were you registering before/during/after your upgrade process ? I'd understand your reluctance to share this information in a public forum. No offence taken whatsoever ! Kind regards, Eric Luyten, Computing Centre VUB/ULB, eric.luy...@vub.ac.be migration process from 2.3 to 2.4 took ~ one year for our installation. we converted ~200Tb of users data. first step we did - spread data on many nodes using cyrus replication. next we started converting nodes one by one at weekends nights to minimize IO load generated by users. in fact cyrus read all data from disk to generate new indexes, so convert is limited by disk IO mainly while CPU is pretty cheap nowadays. we got around 500Gb in 8 hours rate for forced reindex with 100% disk load. we started forced reindex with most active users meanwhile allowing users to login and trigger reindex of their mailboxes Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Another 2.4 upgrade horror story
On 2012-09-25 19:05, Simon Beale wrote: The only gotcha I experienced was I forgot that cyrus was configured to hardlink mail, which of course was no longer the case after each mailbox was migrated, so my disk usage exploded. (But easily fixed/restored once identified). What did you use for restoring the hardlinks? freedup as well? I'm asking because I found a bug in freedup causing dataloss. I already sent a patch fixing it to the author of freedup last november, but he didn't release a new version yet. In case cyr_expire is running while freedup tries to hardlink files it is possible to loose both the source freedup wants to link to and the copy freedup still removes on error. Running cyr_expire and freedup (up to 1.6-2) together is a really bad idea. If it's of interest I can provide my patch here, too. Greetings, Wolfgang -- Wolfgang Breyha wbre...@gmx.net | http://www.blafasel.at/ Vienna University Computer Center | Austria Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
Re: Another 2.4 upgrade horror story
On 2012-09-25 19:05, Simon Beale wrote: The only gotcha I experienced was I forgot that cyrus was configured to hardlink mail, which of course was no longer the case after each mailbox was migrated, so my disk usage exploded. (But easily fixed/restored once identified). What did you use for restoring the hardlinks? freedup as well? I'm asking because I found a bug in freedup causing dataloss. I already sent a patch fixing it to the author of freedup last november, but he didn't release a new version yet. In case cyr_expire is running while freedup tries to hardlink files it is possible to loose both the source freedup wants to link to and the copy freedup still removes on error. Running cyr_expire and freedup (up to 1.6-2) together is a really bad idea. If it's of interest I can provide my patch here, too. Hi, I have not used freedup for restoring the hardlinks but I'm interested in the patch. If it's not big could you post it here? Thanks, Simon Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus