On Montag, 4. Juni 2007 Jonathan Feally wrote:
>  `hash` varchar(64) NOT NULL,

Wouldn't it be better to say
  `sha256` varchar(64) NOT NULL,

That would explain the content, and if we'd have to switch the checksum 
once (who knows?), the new field could be named sha1024, recalculated, 
and then the old one deleted.

> I propose using sha256 as it will give a more unique key at 64 chars
> than sha1 at 40 chars and only doing the sha256 on the first 8 or 16
> MB. I can't think of any real data files that could result in the
> same size and sha256 of the first 8 MB. This could also be
> configurable in the dbmail.conf to what ever size the admin choses,
> and could be changed later, but would require dbmail to be shut down
> and the dbmail-util run to update every hash in the database to use
> the new value.

It should be even enough to take the first and last 512K of a message. 
With intelligent hardware and zero-copy mechanisms, it could be 
possible to directly pass the data stream into SQL, without copying it 
around in memory. Then the checksum calculation can be a big 
performance hit. Also, by taking the last n bytes of a messages, file 
formats like AVI that have the header in the end get a better checksum, 
whenever the beginning of a video is the same and it just differs 
later.

Another idea: We should speak with people from Razor, Pyzor and DCC. 
They always calculate checksums from e-mails. Maybe they have a great 
idea for that, so we could use that checksum for SPAM processing by 
them? That means we would natively support checksumming, a big 
advantage in SPAM fighting.     

> Features:
> All parts of the messages can be stored uniquely.

Good.

> Duplicate attachments/parts - even if they were renamed - can't occur
> - while the original attached file name remains intact.

That's great, and should really be used.

> I purposly left the boundry's in there so that original message could
> be re-built in SQL into its original complete data stream for
> external use or mail export.

Yes, very important, too. Things like SA need that.

> The is_header field can be removed as message_part=0 will always
> contain the headers of the message.

For SPAM processing, research and forensics, it would be better to store 
a header per header line, as it would be indexable. We do a lot in this 
field, and I'd say that this is a good feature if you want to re-scan 
messages for SPAM at a later time (easy example: you find out that host 
x.x.x.x is a zombie, so just search for mails received today from that 
host and rescan them). Possibly it would be good to make headers appear 
in a separate table:

CREATE TABLE `dbmail_headerparts` (
  `messagepart_idnr` bigint(20) REFERENCES 
dbmail_messageblks(`messagepart_idnr`),
  `messagepart_order` bigint(20) NOT NULL,
  `headername` varchar(255) NOT NULL,
  `headervalue` varchar(255),
  PRIMARY KEY  (`messagepart_idnr`,`messagepart_order`),
  KEY `headername` (`headername`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Or, thinking more far, most headers will be nearly the same, because 
people tend to exchange e-mail with the same people regularly, so you 
could possibly do the same "store a headerline only once" mechanism 
here that we use for the messageparts then. It would be more work to 
insert an e-mail, but searching could be a lot faster, because the 
index is kept smaller and the caches would be filled better. Having 
millions of e-mail with a headerline like
X-message-scanned-by: our-anti-spam-and-anti-virus-product v34.3.3.3.53
is not really interesting, and when we already "compress" messageblocks, 
why not do the same with headerlines?

Most people will do SPAM processing, and if we make the first mail 
server to support SPAM scanning in a very nice and quick way, that 
could be good advertising :-)

> Rebuiling of headers can be done quickly with message_part=0

This can still be done, if headerparts is a sub of message_part=0

mfg zmi
-- 
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0676/846 914 666                      .network.your.ideas.
// PGP Key:         "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: EA39 8918 EDFF 0A68 ACFB  11B7 BA2D 060F 1C6F E6B0
// Keyserver: www.keyserver.net                   Key-ID: 1C6FE6B0

Attachment: signature.asc
Description: This is a digitally signed message part.

_______________________________________________
DBmail mailing list
[email protected]
https://mailman.fastxs.nl/mailman/listinfo/dbmail

Reply via email to