Il 12/01/2010 10:30, Magnus Hagander ha scritto:
The problem is usually with strange looking emails with 15 different
MIME types. If we can figure out the proper way to render that, the
rest really is just a SMOP.

Yeah, I was expecting some, but all the message I've looked at seemed to be working ok.

(BTW, for something to actually be used In Production (TM), we want
something that uses one of our existing frameworks. So don't go
overboard in code-wise implementations on something else - proof of
concept on something else is always ok, of course)

OK, that's something I didn't know, even though I expected some kind of limitations. Could you please elaborate a bit more (i.e. where to find info)?

Having played with it, here's my feedback about AOX:

pros:
- seemed to be working reliably;
- does most of the dirty job of parsing emails, splitting parts, etc
- highly normalized schema
- thread support (partial?)

cons:
- directly publishing the live email feed might not be desirable
- queries might end up being a bit complicate for simple tasks
- might be not easy to add additional processing in the workflow


So just to put this into perspective and give anyone paying attention
an idea of the pain that lies ahead should they decide to work on
this:

- We need to import the old archives (of which there are hundreds of
thousands of messages, the first few years of which have, umm, minimal
headers.

Anyone having a local copy of this in his mailboxes? At some point there
were some NNTP gateway, so maybe there's a copy this way.

We have MBOX files.

IIRC, aox has an import function that can read MBOX files. The
interesting thing is what happens with the really old files that don't
have complete headers.

I don't think you can trust the NNTP gateway now or in the past,
messages are sometimes lost there. The mbox files are as complete as
anything we'll ever get.

Importing the whole pgsql-www archive with a perl script that bounces messages via SMTP took about 30m. Maybe there's even a way to skip SMTP, I haven't looked into it that much.

- We need to generate thread indexes

We have CTEs :)

Right. We still need the threading information, so we have something
to use our CTEs on :-)

But I assume that AOX already does this?

there are thread related tables and they seem to get filled when a SORT IMAP command is issued, however I haven't found a way to get the hierarchy out of them.

What that means is that we'd need some kind of post processing to populate a thread hierarchy.

If there isn't a fully usable thread hierarchy I was more thinking to ltree, mainly because I've successfully used it in past and I haven't had enough time yet to look at CTEs. But if performance is comparable I don't see a reason why we shouldn't use them.

- We need to re-generate the original URLs for backwards compatibility

I guess the message-id one ain't the tricky one... and it should be
possible to fill a relation table like
  monharc_compat(message_id, list, year, month, message_number);

Yeah. It's not so hard, you can just screen-scrape the current
archives the same way the search server does.

Definitely an easy enough task.

With all that said, I can't promise anything as it all depends on how much spare time I have, but I can proceed with the evaluation if you think it's useful. I have a feeling that AOX is not truly the right tool for the job, but we might be able to customise it to suit our needs. Are there any other requirements that weren't specified?


Cheers
--
Matteo Beccati

Development & Consulting - http://www.beccati.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to