Paul Stevens <[EMAIL PROTECTED]> wrote: > The headervalues table is indeed too large. I've been thinking about > that. We could store headervalues per mimepart, rather than per > physmessage. That would not save a lot. Or we could avoid storing > headervalues for headernames starting with X- (except those being > fetched by imap clients). That would save probably 40-70% on headervalue > storage. A lot of those headervalues for X- headernames are quite often > duplicates to begin with. Try looking up how many distinct headervalues > you have headernames like x-virus-scanned.
That actually might not be the best example. It turns out that a timestamp is part that headervalue. I.e., for each message, this header will in general, be unique. However, the x-virus-status header does illustrate your point. That header should normally take on one of two values, Clean or Virus. (Although, I suppose it is possible that the virus name could be encoded into the header value as well.) On my test system, out of 209 headers, 159 are x-headers. > I've created a view that provides some insight; > > CREATE headername_stats AS select > h.id AS id, > h.headername AS headername, > count(v.id) AS frequency > from ( > dbmail_headername h > join dbmail_headervalue v on h.id = v.headername_id > ) group by h.id; > > A lot of headervalues can safely be deleted. So perhaps we shouldn't > store those to begin with. >From a purely computer science and sysadmin point of view, I think this would make great sense. However, I do work with several people who would be *very* annoyed if they found out that their email system had this kind of behavior. In general, my own design philosphy with this kind of thing would be to leave it up to local site policy. If this were made into a configuration option, I would find it to more tolerable. We can give administrators the choice between being paranoid and faster search times.... Moreover, just looking at evolution 2.12.3, it turns out that one of its search options is to search by mailing list. It uses the X-BeenThere header to accomplish this. It also has a search type called "expression" which involves code written in scheme. That could mean anything. But since the documentation states that option is for programmers only, I'm not too worried about it. > And perhaps we should take this even further. Consider we choose not to > store *any* new headernames - and their values unless the headername is > already registered in the headernames table. We pre-seed that table with > those headernames known to be used by imap clients. As soon as a imap > client uses a headername in a fetch or search command, we insert a new > value in the headernames table and start storing headervalues for any > newly injected emails. Do most IMAP clients work by sepcifying the headers that they want, or do they just use something like 4 fetch 1098 (BODY[HEADER] BODY[TEXT]) In which case, headers won't get added to the list very often. Of course, that's probably the point. For the people like me who tend be on the paranoid side with respect to their email, I wonder if one thing we could do is modify Paul's plan as follows: 1. We keep a list of headers (and we pre-seed this list) whose values are kept in headervalues. 2. Any other headers get stored in extraheadername and extraheadervalue tables. (We could also use cruftheader as a base name as well. ;-) 3. Once a client requests a header, then that header gets moved into the main header tables. I don't know if there is a nice way to move the data associated with that header from the extraheader tables to the main header tables once this happens. If not, we would just have to start storing data for this header in the main header table and ignore the extraheader tables for this new header going forward. 4. We could then make it configurable as to whether or not something like a TEXT search hits the extraheader tables. 5. Something like 5 fetch 738 (BODY[HEADER]) will always get data from both the header and extraheader tables. > For example, a *lot* of headervalues are for headernames like > 'received'. But that header is hardly if ever retrieved by imap clients. Again, the received headers are *really* useful when you need to debug mail problems. Moreover, they are *very* helpful in detecting phishing. (There are other ways as well, I know. But a real quick look at the received headers can make it *far* easier.) As a sysadmin, I would be very annoyed if there were no access to received headers in my mail. That said, the following is quite enlightening: dbmail=> select * from dbmail_headername where headername = 'received'; id | headername ----+------------ 4 | received (1 row) dbmail=> select count(*) from dbmail_headervalue where headername_id = 4; count --------- 2828027 (1 row) dbmail=> select count(*) from dbmail_headervalue; count ---------- 21571082 (1 row) In other words, a full 13% of the headers are received headers. If we were to move all of those into their own table, I think that would provide a modest improvement in HEADER searches. The only searches that would require access to this new table would be TEXT and 'HEADER RECEIVED' searches. (The former are probably somewhat common, but probably nowhere near the majority of searches.) We would still need to access the table when contructing the message for FETCH commands as well. John Guthrie [EMAIL PROTECTED] _______________________________________________ Dbmail-dev mailing list Dbmail-dev@dbmail.org http://twister.fastxs.net/mailman/listinfo/dbmail-dev