On Montag, 4. Juni 2007 Jonathan Feally wrote: > `hash` varchar(64) NOT NULL,
Wouldn't it be better to say `sha256` varchar(64) NOT NULL, That would explain the content, and if we'd have to switch the checksum once (who knows?), the new field could be named sha1024, recalculated, and then the old one deleted. > I propose using sha256 as it will give a more unique key at 64 chars > than sha1 at 40 chars and only doing the sha256 on the first 8 or 16 > MB. I can't think of any real data files that could result in the > same size and sha256 of the first 8 MB. This could also be > configurable in the dbmail.conf to what ever size the admin choses, > and could be changed later, but would require dbmail to be shut down > and the dbmail-util run to update every hash in the database to use > the new value. It should be even enough to take the first and last 512K of a message. With intelligent hardware and zero-copy mechanisms, it could be possible to directly pass the data stream into SQL, without copying it around in memory. Then the checksum calculation can be a big performance hit. Also, by taking the last n bytes of a messages, file formats like AVI that have the header in the end get a better checksum, whenever the beginning of a video is the same and it just differs later. Another idea: We should speak with people from Razor, Pyzor and DCC. They always calculate checksums from e-mails. Maybe they have a great idea for that, so we could use that checksum for SPAM processing by them? That means we would natively support checksumming, a big advantage in SPAM fighting. > Features: > All parts of the messages can be stored uniquely. Good. > Duplicate attachments/parts - even if they were renamed - can't occur > - while the original attached file name remains intact. That's great, and should really be used. > I purposly left the boundry's in there so that original message could > be re-built in SQL into its original complete data stream for > external use or mail export. Yes, very important, too. Things like SA need that. > The is_header field can be removed as message_part=0 will always > contain the headers of the message. For SPAM processing, research and forensics, it would be better to store a header per header line, as it would be indexable. We do a lot in this field, and I'd say that this is a good feature if you want to re-scan messages for SPAM at a later time (easy example: you find out that host x.x.x.x is a zombie, so just search for mails received today from that host and rescan them). Possibly it would be good to make headers appear in a separate table: CREATE TABLE `dbmail_headerparts` ( `messagepart_idnr` bigint(20) REFERENCES dbmail_messageblks(`messagepart_idnr`), `messagepart_order` bigint(20) NOT NULL, `headername` varchar(255) NOT NULL, `headervalue` varchar(255), PRIMARY KEY (`messagepart_idnr`,`messagepart_order`), KEY `headername` (`headername`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; Or, thinking more far, most headers will be nearly the same, because people tend to exchange e-mail with the same people regularly, so you could possibly do the same "store a headerline only once" mechanism here that we use for the messageparts then. It would be more work to insert an e-mail, but searching could be a lot faster, because the index is kept smaller and the caches would be filled better. Having millions of e-mail with a headerline like X-message-scanned-by: our-anti-spam-and-anti-virus-product v34.3.3.3.53 is not really interesting, and when we already "compress" messageblocks, why not do the same with headerlines? Most people will do SPAM processing, and if we make the first mail server to support SPAM scanning in a very nice and quick way, that could be good advertising :-) > Rebuiling of headers can be done quickly with message_part=0 This can still be done, if headerparts is a sub of message_part=0 mfg zmi -- // Michael Monnerie, Ing.BSc ----- http://it-management.at // Tel: 0676/846 914 666 .network.your.ideas. // PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import" // Fingerprint: EA39 8918 EDFF 0A68 ACFB 11B7 BA2D 060F 1C6F E6B0 // Keyserver: www.keyserver.net Key-ID: 1C6FE6B0
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ DBmail mailing list [email protected] https://mailman.fastxs.nl/mailman/listinfo/dbmail
