Here is the analysis of the query:
---------
explain analyze SELECT distinct ip FROM rawdata WHERE sid='2';

<puts_on_database_consulting_hat/>

Ah, the hat is XHTML 1.0 Strict, I see ;-)

Err...  XML strict, but close enough.  :)

This never came up in our discussions of whether to use one or two tables for fastheaders (one table in key/value pairs, or two tables in id/key + keyid/value relationship)... but it seems to me that we'd have a problem
if 10% of the keys were From, another 10% To, then Subject, Date,
Received... and then the last 10% a bunch of random weird headers, like
from spam checkers or whatever else. (Obviously these are example figures,
there are many more common headers). Trying to search for all keys in a
given mailbox with such and such criteria might also trigger a sequential
scan of death, no?

Absolutely!!! But, there are a few tricks worth implementing and a few things to keep in mind.

1) PostgreSQL's planner isn't stupid and knows how to work with large data sets very well 2) PostgreSQL supports partial indexes (ie: CREATE INDEX tbl_col_idx ON tbl (col) WHERE bar = 123; CREATE INDEX...) and its planner makes active use of this information. 3) The number of email headers likely would fit on a single disk page and is an ideal candidate for being stuffed into its own table and having an ID reference the header table. Then Pg would do an index scan on the header table, grab the id, and use the partial index support to selectively grab certain headers insanely fast. 4) There's a table partitioning trick using inheritance that can be used too, but I'll save that rabbit for later.

The rest is a bit insinuatory but underscores a personal frustration with many open source applications and it drives me bonkers (ok, I'm being polite... way worse than bonkers) that people willingly fall into the MySQL trap.

Basically, there is no excuse for any part of dbmail to be slow, regardless of how small or big the data set in dbmail is... the only problem is that dbmail chooses to retain compatibility w/ some other database (*cough*MySQL*cough*) and as a result essentially lobotomizes any efficient or performance friendly data management scheme. :( The case above is a good example of how something can be made fast, but hasn't because they're coming from a strict SQL world that doesn't think in terms of the database doing work for the developer (ie, stored procedures/triggers, etc.). Database are fast if you let them do what they're best at... and can be a dog to work with if a developer's ability to use them effectively is stripped away in the name of cross database compatibility.

I'd like to state also that database compatibility can be achieved, but it has to be done at a higher level than it is currently. Right now dbmail is compensating for a certain RDBMS's lack of functionality and performance greatly suffers as a result. If the abstraction was pushed higher such that data manipulation work that dbmail does now would only be done in the MySQL case, then dbmail could be made more efficient.

For example, sending data to a client (dbmail) so a client can rifle through the data in order to make a decision, then go back to the database to get the final answer to send to the mail client.... *puke* Fundamentally non-optimal, but a necessary approach in a MySQL world. Abstracting to the point that logic would allow the use of prepared statements, stored procedures, triggers, or even data caching would be a good thing and would prevent *many* round trips from dbmail to the database. In an ideal world for PostgreSQL users, dbmail would simply translate IMAP or POP3 into SQL function calls and the database would handle everything else (ie, SELECT get_msg(...), SELECT insert_msg(${username}, ${message}), etc.).

Remember DMS in RDBMS stands for data management system. Let the database manage the data. Right now dbmail is managing data because MySQL can't. -sc

--
Sean Chittenden

Reply via email to