I am looking for comments on this since a final document of some sorts needs to be placed in the amavisd-new/docs/readme files to try to document what schema changes are needed for mysql partitioning. Mark says we have about 2.6.4 coming up, and maybe this should be documented. After reading my first draft of this, it all seems so simple, but it took a LONG time for me to get this right.
With the advent of mysql 5.1 and amavisd-new support for partition_tags, lookups, updates and quarantine cleanup have been greatly improved. mysql had attempted to assist quarantine cleanup by setting up foreign keys, but testing shows that mysql with foreign keys took longer to delete/cleanup records than the separate operations. Some installations have even used mysql for bayes/awl and postgres for the rest since postgres didn't have the performance problems with foreign keys. (Mark: this will REALLY speed things up, and you can save all that ram you used in postgres for disk cache) This first attempt (draft) will try to document the steps needed to A) create a new, from scratch mysql/amavisd installation that supports partitions B) give options/ maybe schema scripts that might take an existing amavisd 2.6x installation and add partitioning. the 'change' one is harder, since you also have to keep something with partition_tag 0 in it for a while, and changing primary keys is rough. you need to drop the index and add it again. A lot of suggestions and assistance on this were given by Mark Martinec, Thomas Gelf and Nathan M. First, a discussion on two different partition schemes: (see <http://www.mail-archive.com/amavis-user@lists.sourceforge.net/msg14271.html> Thomas Gelf: (linear) and Nathan M's (staggered). I selected a variation of Nathan M's schema since we allow our users to select up to 28 days of quarantine. (so I needed 5 partitions, not 4 like Nathan's more on that later) 5 partitions are less than 9 as in Thomas's, so you have less files on the disk, (9+1*4 vs 5+1*4). and access and cleanup are still instantaneous, and it doesn't matter if you change from 7 days to 14 days to 16 days to 24 days to 28 days while running. In fact, in our service, some clients quarantine viruses for 2 days, attachments for 7 and spam for 14. This is what we are currently using (note the addition of the year, I have a modified iso8601_year() routine I put into amavisd.conf to support an iso year/week. Also note that the iso year is NOT the calender year, last or first week of the calender year sometimes.) CREATE TABLE maddr ( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, partition_tag integer NOT NULL DEFAULT 0, email varbinary(255), domain varchar(255) NOT NULL, -- only domain part of the email address -- with subdomain fields in reverse PRIMARY KEY (id,partition_tag), UNIQUE KEY maddr_idx_tag_email (partition_tag, email) ) ENGINE=InnoDB PARTITION BY LIST (partition_tag) (PARTITION p20091 VALUES IN (200901,200906,200911,200916,200921,200926,200931,200936,200941,200946,200951), PARTITION p20092 VALUES IN (200902,200907,200912,200917,200922,200927,200932,200937,200942,200947,200952), PARTITION p20093 VALUES IN (200903,200908,200913,200918,200923,200928,200933,200938,200943,200948,200953), PARTITION p20094 VALUES IN (200904,200909,200914,200919,200924,200929,200934,200939,200944,200949), PARTITION p20095 VALUES IN (200905,200910,200915,200920,200925,200930,200935,200940,200945,200950)); The basic premise is that even with 28 days max quarantine, no single partition will ever have more than two weeks of data. Weekly cleanup is a matter of a cron or sql event entry at 11:45pm on Sunday night. I truncate the partition that contains NEXT WEEKS data. something like (if NEXT week, in 15 mins.. is in 200904,200909,200914,200919,200924,200929,200934,200939,200944,200949) ALTER TABLE maddr DROP PARTITION p20094; ALTER TABLE maddr ADD PARTITION (PARTITION p20094 VALUES IN ( 200904,200909,200914,200919,200924,200929,200934,200939,200944,200949) (mysql 5.4 is rumored to have a truncate partition) (at 11:45pm, that means that if someone was logged on from 11:45 till midnight and wanted to get email quarantined 27 days, 23 hours and 45 mins ago, he is out of luck) So, in reality, our max quarantine is 27 days 23 hours and 45 mins :-) if you NEVER allow a client to keep more than 21 days, you only need 4 partitions, 14 days, 3. Daily cleanup is really optional, and you really don't need to do anything much more than wipe out old msgs records (YMMV), so at 12:01, we run a routine that cleans up 1) DHA logs (content = '?') 2) viruses older than client setting virus days 3) old banned attachments 4) old bad headers 5) old spam (with 2 - 5 having different dates!) Our old daily cleanup did this and sometimes took a couple hours to do them all (content S, V, etc, one at a time) delete from msgs , quarantine, msgrcpt using msgs natural join quarantine natural join msgrcpt where time_num<=$spam_quarantine2 and content = 'S' and partition_tag <= '$spam_partition_tag'"; new daily only takes seconds on same server, same load delete from msgs where time_num<=$spam_quarantine2 and content = 'S' and partition_tag <= '$spam_partition_tag'"; what about orphaned maddr, msgrcpt, quarantine? don't care. they are gone in less than 6 days, and our reports did joins on msgs,maddr,msgrcpt, quarantine, etc. if the record wasn't send (ie, no sender, if no msgs record), our report didn't even try to match it to msgrcpts. Now, on to create tables, and two problems: #1,any primary index used has to be part of the partitioning, so partition_tag has to be part of it. #2, any multi-key index also has to use partition_tag as part of it. new schema needs to be something like this: (note the id,partition_tag for primary key), partition_tag, email for unique, second key. Note that I have partition_tag,mail_id in that order for msgs, msgrcpt and quarantine. also note that you MUST use at LEAST partition tag in most of these queries for the optimizer to find your records, so, doing a select * from msgs where mail_id='lkjlklkjlj' might take a long time if you don't know what date it was created (and thus partition_tag) Also note, that even with 14 days of data in each partition (as mine above) they are separated by several weeks and time_iso, time_num indexes solve that last issue. (and I clean out msgs records daily) *for maddr:* PRIMARY KEY (id,partition_tag) UNIQUE KEY maddr_idx_tag_email (partition_tag, email) *for msgs:* PRIMARY KEY (partition_tag,mail_id) *for msgrcpt:*(I don't remember why Its not a primary key, and/or why it allows duplicates) KEY (partition_tag,mail_id) *for quarantine:* PRIMARY KEY (partition_tag,mail_id,chunk_ind) COMMENTS? Suggestions? corrections? -- Michael Scheidell, CTO Phone: 561-999-5000, x 1259 > *| *SECNAP Network Security Corporation * Certified SNORT Integrator * 2008-9 Hot Company Award Winner, World Executive Alliance * Five-Star Partner Program 2009, VARBusiness * Best Anti-Spam Product 2008, Network Products Guide * King of Spam Filters, SC Magazine 2008 _________________________________________________________________________ This email has been scanned and certified safe by SpammerTrap(r). For Information please see http://www.secnap.com/products/spammertrap/ _________________________________________________________________________ ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ AMaViS-user mailing list AMaViS-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/amavis-user AMaViS-FAQ:http://www.amavis.org/amavis-faq.php3 AMaViS-HowTos:http://www.amavis.org/howto/