Re: [Dbmail-dev] Searching IMAP faster

Paul J Stevens Tue, 10 Jul 2007 01:44:30 -0700

Aaron Stone wrote:

I wonder if anything besides full text header value searches is expensive
now that we have the header values broken out and indexed? Date ranges,
flags, and so on all seem to be rather cheap. So perhaps we can go to just
two stages: do all cheap stuff, generate a candidate list, then do all the
expensive stuff. Require two iterations of search-merge and some kind of
placeholder marker.

In the search tree there are already boolean markers (node->searched andnode-merged) because there is already some rudimentary multi-pass stuffhappening. That makes it already safe to traverse the tree multiple times.

You'd need to assign a weight factor not just to each individual search action,but als to each non-leaf node in the n-ary search tree.

Remember that each nested search (a non-leaf node containing a bunch of leafnodes that define search actions) needs to be treated as a fully independentgroup of search commands.


so:

SEARCH 1:* SINCE 1-Jan-2007 FROM "dbmail" BODY "patch"

is a single level search. Easy:

( (IST_SET) & (IST_IDATE) & (IST_HDR) & (IST_DATA_BODY) )

IST_SET doesn't require any separate database interaction. The set of msgid/uidnumbers is retrieved once, after that subsequent IST_SET actions are simple treemergers. Cost: 1 (neutral)


IST_DATE uses a well optimized date-range search on a datetime field. Cost: 1.5

IST_HDR does a full table search on the header tables. Cost: 10

IST_DATA_BODY does a full table scan using HAVING (non-indexed). Cost: 100

So total search cost for this graph would be 1*1.5*10*100: 1500

but:

SEARCH 1:* SINCE 1-Jan-2007 FROM "dbmail" OR (BODY "patch")(SUBJECT "diff")

gives:
(
 (IST_SET:1)
 &
 (IST_IDATE:1.5)
 &
 (IST_HDR:10)
 &
 (
  (IST_DATA_BODY:100)
  |
  (IST_HDR:10)
 ):1000
):15

so in this case you'd want to to the first level search first, and use theresults to qualify the second level of nesting. However, things will not alwaysbe so straightforward. Tricky stuff.

A totally different approach would be if we could somehow try to combinesearches into single queries. That would allow us to lean on the query optimizerin the backend. I'm not at all sure that can be done in an elegant (robust)manner. But I havent fully explored it either.


So again lets consider the basic example above.

SEARCH 1:* SINCE 1-Jan-2000 From "test"

produces three queries:

SELECT message_idnr FROM dbmail_messages WHERE mailbox_idnr = XXX AND status IN(0,1) ORDER BY message_idnr;

SELECT message_idnr FROM dbmail_messages m JOIN dbmail_physmessage p ONm.physmessage_id=p.id WHERE mailbox_idnr = XXX AND status IN (0,1) ANDp.internal_date > '2000-01-01 00:00:00' ORDER BY message_idnr;

SELECT message_idnr FROM dbmail_messages m JOIN dbmail_physmessage p ONm.physmessage_id=p.id JOIN dbmail_headervalue v ON v.physmessage_id=p.id JOINdbmail_headername n ON v.headername_id=n.id WHERE mailbox_idnr = XXX AND statusIN (0,1) AND headername LIKE 'from' AND headervalue LIKE '%test%' ORDER BYmessage_idnr


Could we combine these using simple sub-selects? Let's try:

SELECT message_idnr FROM dbmail_messages WHERE mailbox_idnr = XXX AND status IN(0,1)

AND message_idnr IN (
 SELECT message_idnr FROM dbmail_messages m JOIN dbmail_physmessage p ON
 m.physmessage_id=p.id WHERE mailbox_idnr = XXX AND status IN (0,1) AND
 p.internal_date > '2000-01-01 00:00:00'
)
AND message_idnr IN (
 SELECT message_idnr FROM dbmail_messages m JOIN dbmail_physmessage p ON
 m.physmessage_id=p.id JOIN dbmail_headervalue v ON v.physmessage_id=p.id JOIN
 dbmail_headername n ON v.headername_id=n.id WHERE mailbox_idnr = XXX AND status
 IN (0,1) AND headername LIKE 'from' AND headervalue LIKE '%test%'
)
 ORDER BY message_idnr;

bloody hell: it works, and very well indeed! It doesn't even require complexquery construction. Simply concatenate the queries we're using now, with someminor modifications:

1) defer the ORDER BY clause until the very end
2) join the subselects using (AND message_idnr IN

Seems to me this is the better approach after all. I'll see to it.

--
  ________________________________________________________________
  Paul Stevens                                      paul at nfg.nl
  NET FACILITIES GROUP                     GPG/PGP: 1024D/11F8CD31
  The Netherlands________________________________http://www.nfg.nl
_______________________________________________
Dbmail-dev mailing list
Dbmail-dev@dbmail.org
http://twister.fastxs.net/mailman/listinfo/dbmail-dev

Re: [Dbmail-dev] Searching IMAP faster

Reply via email to