Aaron Stone wrote:
I wonder if anything besides full text header value searches is expensive
now that we have the header values broken out and indexed? Date ranges,
flags, and so on all seem to be rather cheap. So perhaps we can go to just
two stages: do all cheap stuff, generate a candidate list, then do all the
expensive stuff. Require two iterations of search-merge and some kind of
placeholder marker.
In the search tree there are already boolean markers (node->searched and
node-merged) because there is already some rudimentary multi-pass stuff
happening. That makes it already safe to traverse the tree multiple times.
You'd need to assign a weight factor not just to each individual search action,
but als to each non-leaf node in the n-ary search tree.
Remember that each nested search (a non-leaf node containing a bunch of leaf
nodes that define search actions) needs to be treated as a fully independent
group of search commands.
so:
SEARCH 1:* SINCE 1-Jan-2007 FROM "dbmail" BODY "patch"
is a single level search. Easy:
( (IST_SET) & (IST_IDATE) & (IST_HDR) & (IST_DATA_BODY) )
IST_SET doesn't require any separate database interaction. The set of msgid/uid
numbers is retrieved once, after that subsequent IST_SET actions are simple tree
mergers. Cost: 1 (neutral)
IST_DATE uses a well optimized date-range search on a datetime field. Cost: 1.5
IST_HDR does a full table search on the header tables. Cost: 10
IST_DATA_BODY does a full table scan using HAVING (non-indexed). Cost: 100
So total search cost for this graph would be 1*1.5*10*100: 1500
but:
SEARCH 1:* SINCE 1-Jan-2007 FROM "dbmail" OR (BODY "patch")(SUBJECT "diff")
gives:
(
(IST_SET:1)
&
(IST_IDATE:1.5)
&
(IST_HDR:10)
&
(
(IST_DATA_BODY:100)
|
(IST_HDR:10)
):1000
):15
so in this case you'd want to to the first level search first, and use the
results to qualify the second level of nesting. However, things will not always
be so straightforward. Tricky stuff.
A totally different approach would be if we could somehow try to combine
searches into single queries. That would allow us to lean on the query optimizer
in the backend. I'm not at all sure that can be done in an elegant (robust)
manner. But I havent fully explored it either.
So again lets consider the basic example above.
SEARCH 1:* SINCE 1-Jan-2000 From "test"
produces three queries:
SELECT message_idnr FROM dbmail_messages WHERE mailbox_idnr = XXX AND status IN
(0,1) ORDER BY message_idnr;
SELECT message_idnr FROM dbmail_messages m JOIN dbmail_physmessage p ON
m.physmessage_id=p.id WHERE mailbox_idnr = XXX AND status IN (0,1) AND
p.internal_date > '2000-01-01 00:00:00' ORDER BY message_idnr;
SELECT message_idnr FROM dbmail_messages m JOIN dbmail_physmessage p ON
m.physmessage_id=p.id JOIN dbmail_headervalue v ON v.physmessage_id=p.id JOIN
dbmail_headername n ON v.headername_id=n.id WHERE mailbox_idnr = XXX AND status
IN (0,1) AND headername LIKE 'from' AND headervalue LIKE '%test%' ORDER BY
message_idnr
Could we combine these using simple sub-selects? Let's try:
SELECT message_idnr FROM dbmail_messages WHERE mailbox_idnr = XXX AND status IN
(0,1)
AND message_idnr IN (
SELECT message_idnr FROM dbmail_messages m JOIN dbmail_physmessage p ON
m.physmessage_id=p.id WHERE mailbox_idnr = XXX AND status IN (0,1) AND
p.internal_date > '2000-01-01 00:00:00'
)
AND message_idnr IN (
SELECT message_idnr FROM dbmail_messages m JOIN dbmail_physmessage p ON
m.physmessage_id=p.id JOIN dbmail_headervalue v ON v.physmessage_id=p.id JOIN
dbmail_headername n ON v.headername_id=n.id WHERE mailbox_idnr = XXX AND status
IN (0,1) AND headername LIKE 'from' AND headervalue LIKE '%test%'
)
ORDER BY message_idnr;
bloody hell: it works, and very well indeed! It doesn't even require complex
query construction. Simply concatenate the queries we're using now, with some
minor modifications:
1) defer the ORDER BY clause until the very end
2) join the subselects using (AND message_idnr IN
Seems to me this is the better approach after all. I'll see to it.
--
________________________________________________________________
Paul Stevens paul at nfg.nl
NET FACILITIES GROUP GPG/PGP: 1024D/11F8CD31
The Netherlands________________________________http://www.nfg.nl
_______________________________________________
Dbmail-dev mailing list
Dbmail-dev@dbmail.org
http://twister.fastxs.net/mailman/listinfo/dbmail-dev