Re: [Dbmail-dev] Speed-up proposals

Mikhail Ramendik Sun, 24 Oct 2004 10:40:52 +0200 (CEST)

Paul J Stevens wrote:

> > Note: the code is hard to read and I could find no documentation on the
> > database, so I may be wrong here. If so please correct me...
> 
> hard to read... understatement at large. That code is a total bitch. A
> messy heap of shit which makes 
> overcooked spagetti look like soldiers on parade. I actually read that
> code. Not once, but many times. No 
> kidding, and not funny so stop laughing.


My grand idea would be to rewrite it - in Python. Python has a g-r-e-a-t
MIME parser sitting right in the standard libraries. And as for CPU
usage, just shift everything we can to the SQL engine.

> > (2) Create an index for the 'messageblk' field in dbmail_messageblks. At
> > least MySQL allows this (can anyone tell me if Postgres does?). Then, on
> > IMAP header field searches, do not load/parse/check; instead, create a
> > regular expression  and do a SELECT with it, selecting only header
> > blocks. (MySQL specific again, Postgres comments welcome).
> 
> well, the messageblks can become largish, and mysql also has some
> limit here. Too bad innodb doesn't support 
> full-text searches.

As far as I understand, it does with regular expressions. Just not using
indexes. Slow, but not as slow as the full read-and-parse loop.

Or am I wrong here?

> I wonder what kind of index we get when we add one to the messageblk
> field... I should try that some time. But 
> not on my main development machine :-)

It will faithfully index the first whatever-number-you-state of
characters.

On MyISAM it can give you a full text index.

To reduce the index size one could move the header messageblks to a
separate table. And, make that table (and it alone) MyISAM, and go for
full text indexes, eh? It's a hack, true. 

I have an idea that would not be a hack, too. In full accordance with
database theory. A Header_Fields table for header field names,
Header_Values for header field values (referencing Header_Fields), and
Message_Headers for referencing Header)Values for every message (with a
sort field too). From these, the headers for a message can be recovered
exactly as they were - and at the same time every header search and
fetch is a quick, fully indexed query.

The only problem is - I'm afraid this idea won't fit into the existing
code. The coding here is somewhat complicated. Well, perhaps someone who
knows the code well can do it. If you want I can design the queries, but
I won't touch the fetch and search code in dbmail with a flag pole.

> > This will probably result in a *dramatic* speedup, at the cost of some
> > coding complexity. 
> 
> This would actually simplify a lot of code.

Well, I just tend to think that regular expressions = complexity. And
the "hacky" variant relies on them heavily. But on the other hand, the
parse-and-check loop may represent even more complexity. 

> > BTW, why is the "is_header" field unused in dbmail_messageblks? Or at
> > least I found no place in the code that would use it. Is it redundant?
> 
> Forward compatibility. I did commit a patch to cvs-head to start using
> this field for message insertions, and 
> have posted a bash/awk script that will fill this field for existing
> messageblk rows.

Posted where?

> > P.S. To be very honest, I did not like the coding style in dbmail. The C
> > language for this task would not be my choice, but never mind that.
> > Functions that are hundreds of lines long, with only minimal comments,
> > are much more problematic.
> 
> The original author was probably a very smart guy :-) but a not so
> very experienced programmer.

Agreed. I actually recognized the coding style - I wrote that way in
high school. (Only in Pascal, so it was a little bit more readable).

> I actually volunteered to work on that code last summer :-/ and I've
> already begun splitting up some of those 
> functions. The recent addition of struct ImapSession in
> dbmail-imapsession.c is phase one of that refactoring. 
> The next phase (splitting up and cleaning _ic_fetch) is well underway
> and currently being tested. Still much 
> remains to be fixed, simplified, cleaned-up, etc, etc.

Well, I'm not so sure this can work, but then I might simply be
frightened by the current code. And influenced by Eric Raymond - "Plan
to throw one away, you will, anyhow".

Besides, by Real Grand Idea would be to use dbmail for local storage,
ultimately building a direct interface to it into at least one mail
reading program. (It's not as crazy as it sounds, when one has nearly a
gigabyte of mail lying around - like I do.)

And that would require a *clean* separation of functions, just Not
Present in existing code. For example, my idea would involve a search
and a fetch defined as functions, and separate IMAP parsing functions
calling them.

> > Perl or Python (I prefer Python)  with their ready-made RFC822 parsing,
> > along with some DB-expert friends, would help me write an alternative
> > database store quickly. But I really am not up to implementing protocols
> > (SMTP, LMTP, POP3, IMAP). If any Python guru here would go for that, we
> > could try a rewrite :)
> 
> I actually started out working on dbmail while studying the twisted
> framework which has finished imap and pop3 
> implementations. Should be quite easy to write sql-based storage
> engines for those interfaces.

Well, then there could be a compromise proposal. Dump-and-rewrite
storage while refactoring imap and pop3, and introduce a clean interface
between them by the way. Even existing _ic_fetch (yuck!) code could be
used for parsing arguments - and then passing them to a well defined
interface.

> But there are some advantages to refactoring the current code. And c
> is just another language to write code 
> in. With glib for datatypes, and gmime for message-parsing that code
> could actually become more fun not only 
> to write, but better yet, to read again later.

Well, if you do use gmime for message parsing, then probably yes. C has
deficiencies in string handling, but one can get around them. 

> With the low level of output from Ilja recently, I am becoming
> somewhat concerned that maybe IC&S have decided 
> to sink their investment in dbmail.

This may be THE critical question. If Ilja goes on with this code, then
it's worth refactoring (especially since a rewrite might mean losing
that).

> Should I find myself alone in actually working on the current codebase
> I will re-evaluate dbmail's viability, 
> and may indeed decide to either go for a complete rewrite in python or
> move on to other projects.

If you go for Python, you're not alone.

I'm actually not really a programmer, I'm a technical writer who can
also program. But this has advantages. If a team for a rewrite shows up,
and before any coding (except table definition and example queries in
SQL), I will start with clear documentation on both the database and the
interfaces. Might work wonders for clean code. 

And when that is complete, I can do some coding on database storage. But
protocol implementation is a greater question. At least POP3 and IMAP.
Scratch that - at least IMAP, we could live without POP3 for some time.

Yours, Mikhail Ramendik

Re: [Dbmail-dev] Speed-up proposals

Reply via email to