Here is the analysis of the query:
---------
explain analyze SELECT distinct ip FROM rawdata WHERE sid='2';
<puts_on_database_consulting_hat/>
Ah, the hat is XHTML 1.0 Strict, I see ;-)
Err... XML strict, but close enough. :)
This never came up in our discussions of whether to use one or two
tables
for fastheaders (one table in key/value pairs, or two tables in id/key
+
keyid/value relationship)... but it seems to me that we'd have a
problem
if 10% of the keys were From, another 10% To, then Subject, Date,
Received... and then the last 10% a bunch of random weird headers, like
from spam checkers or whatever else. (Obviously these are example
figures,
there are many more common headers). Trying to search for all keys in a
given mailbox with such and such criteria might also trigger a
sequential
scan of death, no?
Absolutely!!! But, there are a few tricks worth implementing and a few
things to keep in mind.
1) PostgreSQL's planner isn't stupid and knows how to work with large
data sets very well
2) PostgreSQL supports partial indexes (ie: CREATE INDEX tbl_col_idx ON
tbl (col) WHERE bar = 123; CREATE INDEX...) and its planner makes
active use of this information.
3) The number of email headers likely would fit on a single disk page
and is an ideal candidate for being stuffed into its own table and
having an ID reference the header table. Then Pg would do an index
scan on the header table, grab the id, and use the partial index
support to selectively grab certain headers insanely fast.
4) There's a table partitioning trick using inheritance that can be
used too, but I'll save that rabbit for later.
The rest is a bit insinuatory but underscores a personal frustration
with many open source applications and it drives me bonkers (ok, I'm
being polite... way worse than bonkers) that people willingly fall
into the MySQL trap.
Basically, there is no excuse for any part of dbmail to be slow,
regardless of how small or big the data set in dbmail is... the only
problem is that dbmail chooses to retain compatibility w/ some other
database (*cough*MySQL*cough*) and as a result essentially lobotomizes
any efficient or performance friendly data management scheme. :( The
case above is a good example of how something can be made fast, but
hasn't because they're coming from a strict SQL world that doesn't
think in terms of the database doing work for the developer (ie, stored
procedures/triggers, etc.). Database are fast if you let them do what
they're best at... and can be a dog to work with if a developer's
ability to use them effectively is stripped away in the name of cross
database compatibility.
I'd like to state also that database compatibility can be achieved, but
it has to be done at a higher level than it is currently. Right now
dbmail is compensating for a certain RDBMS's lack of functionality and
performance greatly suffers as a result. If the abstraction was pushed
higher such that data manipulation work that dbmail does now would only
be done in the MySQL case, then dbmail could be made more efficient.
For example, sending data to a client (dbmail) so a client can rifle
through the data in order to make a decision, then go back to the
database to get the final answer to send to the mail client.... *puke*
Fundamentally non-optimal, but a necessary approach in a MySQL world.
Abstracting to the point that logic would allow the use of prepared
statements, stored procedures, triggers, or even data caching would be
a good thing and would prevent *many* round trips from dbmail to the
database. In an ideal world for PostgreSQL users, dbmail would simply
translate IMAP or POP3 into SQL function calls and the database would
handle everything else (ie, SELECT get_msg(...), SELECT
insert_msg(${username}, ${message}), etc.).
Remember DMS in RDBMS stands for data management system. Let the
database manage the data. Right now dbmail is managing data because
MySQL can't. -sc
--
Sean Chittenden