Re: [Dbmail-dev] header storage schema changes. (W)Here we (may) go.

Paul J Stevens Sat, 1 Jan 2005 22:21:17 +0100 (CET)

Matthew T. O'Connor wrote:

Agreed, the only reason I propose the header_list table is forperformance reasons. This allows the header_values table (which will bemuch bigger) to be searched based on an int comparison rather than textsearch. I think this is a serious performance boost, but if it's proventhat it's not, then we don't need it.


I'm still doubtfull.

header_list: ( Contains an exhaustive list of all headers from allmessages in the database. )


[...]

header_values: ( Contains the values from all the headers in all themessages in database )


[...]

This structure will make it very easy to query all the headers from agiven message or find all the messages with a given header, or agiven header value. It also leaves our current structure intactwhich will make it easier to phase in.

It will work, but you still require a lookup on an extra table to do header searches. There has to be hiddencost there. Also the added db-interaction at insertion time is very significant.

What I like about the yukatan approach is that it tries to make searches on very common headers as cheap andfast as possible. It does this by going one step beyond separate header storage: preparse certain headers forcommon attributes: the in-reply-to and references headers will be used for threading. These headers containone or more message-id header values, which are stored separately. With this approach building message-threadscan be done with fully indexed, single-table queries! Of course a union with the msgids from the openedmailbox is still required, but you can't beat such a setup wrt threading.

What do you think? I don't think we need to special case any headersnot even sendername or subject.

The more I think about this, the more convinced I am that we do need to treat common headers differently inthe end. And your suggestion to preseed the header_names table and hardcode their ids tells me we're agreed here.

The other major thing they have on us is that they manage each MIMEentity inside an email separately. This has nice advantages forsearching headers of attachments etc.. That would be a large changefrom our datamodel, but might be nice to think about someday.

We can *easily* switch to storing attachments separately. The whole messageblk approach will support thisflawlessly. In fact, I can switch to storing attachments in separate blks today, and noone will notice becauseit will be transparant from the point of view of message retrieval.

But I don't think searching for certain attachment headers is a very common use-case at the moment. IIRC, imapdoesn't support this. It could probably help though in building BODYSTRUCTURE response, so lets consider thisfor a moment.

If I want to store mime-part headers the same as message-headers, first thing I have to do is change thephysmessage_id references in the header tables to messageblk_idnr references. Next I would have to storemime-part headers same as message-headers. Finally I would have to change the messageblks model


from:

block[0]: message-header
block[1]: message-body[slice1]
block[N]: message-body[sliceN]

to

block[0]: raw message-headers
block[1]: mime-part[0]
block[N]: mime-part[N]

Where the only really tricky part is converting an existing dbmail storage to this modified approach. Still,way cool stuff.


But that's me getting ahead of myself. Small steps.

Their model makes a lot of sense in many ways, but I still don't likesome of it. They special case a handful of headers for each MIME entitynot only by having a separate copy in the entity table but also byhaving a separate table for many of these headers. Perhaps someday wecan add this for performance reasons, but I don't think we need to.Yukatan also has a headers table much like the one I described above,they don't have the header_list table broken out the way I do, but Ithink that is why then need to special case alot of the headers.
So in summary, yes the Yukatan model is nice, and has a lot ofadvantages over ours, but I still think we are best served by startingwith the two table design I described earlier. This should besufficiently fast and flexible that we can go a long way before we haveto special case anything.

I'll keep your idea in mind as I start building test-cases for the insertion phase. But I still think thathaving to choose between a convoluted 'if not select header_name then insert header_name; insertheader_value;' and a simple 'insert header_name, header_value' for each header of each message inserted, thelatter seems to make more sense for now.


But I'm always open to arguments and willing to change my mind (as-if my wife 
would say :-)

Thanks for helping me think this through a little.



--
  ________________________________________________________________
  Paul Stevens                                  mailto:[EMAIL PROTECTED]
  NET FACILITIES GROUP                     PGP: finger [EMAIL PROTECTED]
  The Netherlands________________________________http://www.nfg.nl

Re: [Dbmail-dev] header storage schema changes. (W)Here we (may) go.

Reply via email to