Paul J Stevens wrote: > Charles Marcus wrote: > >> There should obviously be a method for dealing with high-load >> conditions, where DBMail stores the message initially without doing the >> SIS work, but flags it for processing later when the load goes below a >> pre-configured level. >> > > I don't see how that's obvious. If we move to storing unique mime-parts > only once, the only way to store any mime-part is by calculating the > sha1 value for such a mime part. What you are proposing is a stepped > insertion. That would require a full mail-spool type setup. Better leave > that to the MTA: simply refuse SMTP/LMTP connections when the load is > too high. > You wouldn't really need a spool and other big overhead type things. Just store all incoming email attachments into the DB as you would if you weren't going to check for duplicates then at some later date your "util" program can come along check all unflagged messages see if their attachments are duplicates of another and if so change the data in the table to point that message at the "master" attachment.
That makes an interesting point. On deletion of a message your going to have to scan the database to see if anybody else is using "your" attachment and leave it behind. The utils program will need to check for orphaned attachments. > >>> Although as an indicator I timed an md5sum on a 2.4gb file and got about >>> 48 seconds (Pentium D ~2.8ghz or so, 15krpm scsi hdd, ubuntu 6.10) so at >>> 100% cpu you can MD5 about 50mb of data per second probably not worth >>> the hassle of a separate run. Thats not so bad. (50mb emails would (I >>> hope) be fairly rare?) >>> > > Don't count on it. Once we have this setup, using dbmail as an archive > server is that much more attractive. People may very well start using it > to store big files, and a lot of it! > > Btw, MD5 is out. If I do this, SHA1 seems much better (less chance of > collisions), unless something better shows up on the radar. > > I just did the test again using openssl and it appears disk limited on my machine (which is a good sign) it pulled about 20% cpu in both md5 and sha1 tests and i know this disk will pass about 80MB/s. So really I don't see a reason not to process it on receipt. Provided the hash table is kept small IE "md5/sha1, file name, file size, blob id" then even with several million attachments it should be tiny and a worst case full table scan would take no time at all (hopefully the whole thing would get cached and indexing would help too). You could even see a performance increase on disk bound systems because you won't need to write large blobs to the database.
_______________________________________________ DBmail mailing list [email protected] https://mailman.fastxs.nl/mailman/listinfo/dbmail
