Aaron Stone wrote:

I wonder if anything besides full text header value searches is expensive
now that we have the header values broken out and indexed? Date ranges,
flags, and so on all seem to be rather cheap. So perhaps we can go to just
two stages: do all cheap stuff, generate a candidate list, then do all the
expensive stuff. Require two iterations of search-merge and some kind of
placeholder marker.

In the search tree there are already boolean markers (node->searched and node-merged) because there is already some rudimentary multi-pass stuff happening. That makes it already safe to traverse the tree multiple times.

You'd need to assign a weight factor not just to each individual search action, but als to each non-leaf node in the n-ary search tree.

Remember that each nested search (a non-leaf node containing a bunch of leaf nodes that define search actions) needs to be treated as a fully independent group of search commands.

so:

SEARCH 1:* SINCE 1-Jan-2007 FROM "dbmail" BODY "patch"

is a single level search. Easy:

( (IST_SET) & (IST_IDATE) & (IST_HDR) & (IST_DATA_BODY) )

IST_SET doesn't require any separate database interaction. The set of msgid/uid numbers is retrieved once, after that subsequent IST_SET actions are simple tree mergers. Cost: 1 (neutral)

IST_DATE uses a well optimized date-range search on a datetime field. Cost: 1.5

IST_HDR does a full table search on the header tables. Cost: 10

IST_DATA_BODY does a full table scan using HAVING (non-indexed). Cost: 100

So total search cost for this graph would be 1*1.5*10*100: 1500

but:

SEARCH 1:* SINCE 1-Jan-2007 FROM "dbmail" OR (BODY "patch")(SUBJECT "diff")

gives:
(
 (IST_SET:1)
 &
 (IST_IDATE:1.5)
 &
 (IST_HDR:10)
 &
 (
  (IST_DATA_BODY:100)
  |
  (IST_HDR:10)
 ):1000
):15

so in this case you'd want to to the first level search first, and use the results to qualify the second level of nesting. However, things will not always be so straightforward. Tricky stuff.

A totally different approach would be if we could somehow try to combine searches into single queries. That would allow us to lean on the query optimizer in the backend. I'm not at all sure that can be done in an elegant (robust) manner. But I havent fully explored it either.

So again lets consider the basic example above.

SEARCH 1:* SINCE 1-Jan-2000 From "test"

produces three queries:

SELECT message_idnr FROM dbmail_messages WHERE mailbox_idnr = XXX AND status IN (0,1) ORDER BY message_idnr;

SELECT message_idnr FROM dbmail_messages m JOIN dbmail_physmessage p ON m.physmessage_id=p.id WHERE mailbox_idnr = XXX AND status IN (0,1) AND p.internal_date > '2000-01-01 00:00:00' ORDER BY message_idnr;

SELECT message_idnr FROM dbmail_messages m JOIN dbmail_physmessage p ON m.physmessage_id=p.id JOIN dbmail_headervalue v ON v.physmessage_id=p.id JOIN dbmail_headername n ON v.headername_id=n.id WHERE mailbox_idnr = XXX AND status IN (0,1) AND headername LIKE 'from' AND headervalue LIKE '%test%' ORDER BY message_idnr

Could we combine these using simple sub-selects? Let's try:

SELECT message_idnr FROM dbmail_messages WHERE mailbox_idnr = XXX AND status IN (0,1)
AND message_idnr IN (
 SELECT message_idnr FROM dbmail_messages m JOIN dbmail_physmessage p ON
 m.physmessage_id=p.id WHERE mailbox_idnr = XXX AND status IN (0,1) AND
 p.internal_date > '2000-01-01 00:00:00'
)
AND message_idnr IN (
 SELECT message_idnr FROM dbmail_messages m JOIN dbmail_physmessage p ON
 m.physmessage_id=p.id JOIN dbmail_headervalue v ON v.physmessage_id=p.id JOIN
 dbmail_headername n ON v.headername_id=n.id WHERE mailbox_idnr = XXX AND status
 IN (0,1) AND headername LIKE 'from' AND headervalue LIKE '%test%'
)
 ORDER BY message_idnr;

bloody hell: it works, and very well indeed! It doesn't even require complex query construction. Simply concatenate the queries we're using now, with some minor modifications:
1) defer the ORDER BY clause until the very end
2) join the subselects using (AND message_idnr IN

Seems to me this is the better approach after all. I'll see to it.

--
  ________________________________________________________________
  Paul Stevens                                      paul at nfg.nl
  NET FACILITIES GROUP                     GPG/PGP: 1024D/11F8CD31
  The Netherlands________________________________http://www.nfg.nl
_______________________________________________
Dbmail-dev mailing list
Dbmail-dev@dbmail.org
http://twister.fastxs.net/mailman/listinfo/dbmail-dev

Reply via email to