Re: [Dbmail-dev] [PATCH] Speed patch for subject searches

John T. Guthrie III Sat, 26 Jan 2008 22:03:15 -0800

Paul Stevens <[EMAIL PROTECTED]> wrote:
> The headervalues table is indeed too large. I've been thinking about
> that. We could store headervalues per mimepart, rather than per
> physmessage. That would not save a lot. Or we could avoid storing
> headervalues for headernames starting with X- (except those being
> fetched by imap clients). That would save probably 40-70% on headervalue
> storage. A lot of those headervalues for X- headernames are quite often
> duplicates to begin with. Try looking up how many distinct headervalues
> you have headernames like x-virus-scanned.


That actually might not be the best example.  It turns out that a timestamp
is part that headervalue.  I.e., for each message, this header will in general,
be unique.  However, the x-virus-status header does illustrate your point.
That header should normally take on one of two values, Clean or Virus.
(Although, I suppose it is possible that the virus name could be encoded
into the header value as well.)

On my test system, out of 209 headers, 159 are x-headers.

> I've created a view that provides some insight;
> 
> CREATE headername_stats AS select
>   h.id AS id,
>   h.headername AS headername,
>   count(v.id) AS frequency
>  from (
>   dbmail_headername h
>   join dbmail_headervalue v on h.id = v.headername_id
>  ) group by h.id;
> 
> A lot of headervalues can safely be deleted. So perhaps we shouldn't
> store those to begin with.

>From a purely computer science and sysadmin point of view, I think this would
make great sense.  However, I do work with several people who would be *very*
annoyed if they found out that their email system had this kind of behavior.
In general, my own design philosphy with this kind of thing would be to leave
it up to local site policy.  If this were made into a configuration option,
I would find it to more tolerable.  We can give administrators the choice
between being paranoid and faster search times....

Moreover, just looking at evolution 2.12.3, it turns out that one of its search
options is to search by mailing list.  It uses the X-BeenThere header to
accomplish this.  It also has a search type called "expression" which involves
code written in scheme.  That could mean anything.  But since the documentation
states that option is for programmers only, I'm not too worried about it.

> And perhaps we should take this even further. Consider we choose not to
> store *any* new headernames - and their values unless the headername is
> already registered in the headernames table. We pre-seed that table with
> those headernames known to be used by imap clients. As soon as a imap
> client uses a headername in a fetch or search command, we insert a new
> value in the headernames table and start storing headervalues for any
> newly injected emails.

Do most IMAP clients work by sepcifying the headers that they want, or do they
just use something like

4 fetch 1098 (BODY[HEADER] BODY[TEXT])

In which case, headers won't get added to the list very often.  Of course,
that's probably the point.

For the people like me who tend be on the paranoid side with respect to their
email, I wonder if one thing we could do is modify Paul's plan as follows:

1. We keep a list of headers (and we pre-seed this list) whose values are
kept in headervalues.
2. Any other headers get stored in extraheadername and extraheadervalue tables.
 (We could also use cruftheader as a base name as well. ;-)
3. Once a client requests a header, then that header gets moved into
 the main header tables.  I don't know if there is a nice way to move the
 data associated with that header from the extraheader tables to the main
 header tables once this happens.  If not, we would just have to start storing
 data for this header in the main header table and ignore the extraheader
 tables for this new header going forward.
4. We could then make it configurable as to whether or not something like a
 TEXT search hits the extraheader tables.
5. Something like

 5 fetch 738 (BODY[HEADER])

 will always get data from both the header and extraheader tables.

> For example, a *lot* of headervalues are for headernames like
> 'received'. But that header is hardly if ever retrieved by imap clients.

Again, the received headers are *really* useful when you need to debug mail
problems.  Moreover, they are *very* helpful in detecting phishing.  (There
are other ways as well, I know.  But a real quick look at the received headers
can make it *far* easier.)  As a sysadmin, I would be very annoyed if there
were no access to received headers in my mail.

That said, the following is quite enlightening:

dbmail=> select * from dbmail_headername where headername = 'received';
 id | headername 
----+------------
  4 | received
(1 row)

dbmail=> select count(*) from dbmail_headervalue where headername_id = 4;
  count  
---------
 2828027
(1 row)

dbmail=> select count(*) from dbmail_headervalue;
  count   
----------
 21571082
(1 row)

In other words, a full 13% of the headers are received headers.  If we were to
move all of those into their own table, I think that would provide a modest
improvement in HEADER searches.  The only searches that would require access
to this new table would be TEXT and 'HEADER RECEIVED' searches.  (The former
are probably somewhat common, but probably nowhere near the majority of
searches.)  We would still need to access the table when contructing the message
for FETCH commands as well.

John Guthrie
[EMAIL PROTECTED]
_______________________________________________
Dbmail-dev mailing list
Dbmail-dev@dbmail.org
http://twister.fastxs.net/mailman/listinfo/dbmail-dev

Re: [Dbmail-dev] [PATCH] Speed patch for subject searches

Reply via email to