On 13-04-14 22:19, KT Walrus wrote:
> DBMail already does a lot of data deduplication (headers, attachments, etc.). 
>  I’m just not clear how far this goes and whether my turning a message to a 
> list of recipients into multiple copies of the message with different To: and 
> possibly different Message-Id: affects the data de-duplication.
> 
> If I should keep the headers the same for all copies of the message to get 
> maximum data deduplication, I will.  I just prefer each recipient see the To: 
> as to only their address and not know about everyone else.
> 
> As for my “app”, it is a PHP app that uses the RoundCube Framework to provide 
> an IMAP interface to the user for accessing their mailbox and some public 
> mailboxes.  The user sends messages using SMTP and I have a milter to send 
> the message to a special outbox mailbox (in DBMail).  Then, I have a PHP cron 
> job that checks the outbox, retrieves the queued messages, preprocesses the 
> message headers, and uses dbmail-deliver to send the message to the 
> appropriate recipients.  
> 
> I have all this working quite nicely.  But, I’m trying to figure out the best 
> way to send a To: customized copy of each message to each recipient.
> 
> I need to understand how DBMail does data deduplication.
>

De-duplication is performed at two levels:

messages are split by 'mime-parts'. The whole rfc2822 header is the
first part. If the body is a text/plain the whole body is a single,
second and last part. If the body is multipart/* or message/rfc822 the
process is restarted for the contained message or for each of the parts
that constitute the multipart. This is done recursively, limited at a
high recursion depth of 64. Or rather message de-construction is
unlimited, but re-construction is capped.

each mime-part is stored de-duplicated in what is called single-instance
storage; keyed with a hash for faster retrieval.

apart from the messages as a whole, the message-headers are also stored
seperately in two tables where both the header-name (to, from, subject)
is stored seperately from their content, the header-values. Both are
stored as unique values which are linked to each other, and to the
message instance where they occurred.

So if you receive a 10MB message to one hundred users, where the
messages are identical, it is fully de-duplicated and only results in a
set of rows in the messages table - and under some circumstances the
physmessage table.

If only the To header is different, the whole rfc822 header is stored in
it's own row in the mimeparts, but the full body is *not* duplicated.
Whether one header, or multiple headers differ between messages is not
an issue. Any difference will lead to a separate row for the headers. Of
course, the header-names and header-values are still stored de-duplicated.

Hope that explains it a bit.

-- 
________________________________________________________________
Paul J Stevens       pjstevns @ gmail, twitter, github, linkedin
           www.nfg.nl/i...@nfg.nl/+31.85.877.99.97
_______________________________________________
DBmail mailing list
DBmail@dbmail.org
http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail

Reply via email to