Hi all,
I believe that Roy has mentioned to some of you that I've been
working on a module that will process mbox archives and display
it in a nice format on the web with some other cool features.
Well, I think that we are at a stage where we would like some feedback
from the Apache community. It has progressed enough where I think it is
stable and feature-complete. Everyone I have shown it to so far has
given positive feedback. Now, for the real critics...
You may see mod_mbox in action at:
http://www.apachelabs.org/
I currently have the entire new-httpd and apr-dev archives on there.
Note that this month's archive of both these lists is from a few days
ago.
I also have ht://Dig running which should allow searching of the
archives. Please feel free to hammer the box. I'm not exactly
sure how efficient ht://Dig is, but it seems to work reasonably
well (the search databases are big too large for my taste though).
The current snapshot of the mod_mbox code is on the website. mod_mbox
is an Apache-2.0 module. The indexing programs use only APR. Note
that I do not currently have access to Win32 platforms - it may not
compile on there, but I doubt that there is anything too platform
specific - it is all based on APR. I have tested this on Linux,
FreeBSD, and Solaris.
You take your mbox file and generate the index (see the provided
generate_index.c file). This creates all of the DBMs necessary for
mod_mbox. Simply add "AddHandler .mbox mbox-file" to your httpd.conf
(or other mechanisms that acheive the same goal of setting the handler
to be either mbox-file or mbox-handler) and you are up with mod_mbox.
Due to the current build system, it is not particularly
straight-forward to build an external module with dependent objects.
I have tried to include enough "hints" in the tarball to provide
guidelines as to building mod_mbox from the source. I don't intend
for what is on apachelabs.org to be a "release," but rather a
"snapshot."
mod_mbox has the advantage over MHonArc in that it will only index
the mbox file when you explicitly tell it to (use the generate_index
program) rather then when a new message is delivered. Here at eBuilt,
we've had to alter our internal mailing-list archival strategy to
compensate for the fact that MHonArc can not handle large lists
well. Ideally, mod_mbox scales better. generate_index on a 750MB
mbox file takes about two or three minutes (Sun U5/360). The only
storage explictly required for mod_mbox is the DBMs. And, with
such a high-traffic list, you can run the index a few times a day
rather than when each new message is delivered.
I do believe that Roy intends to check mod_mbox into the httpd-2.0
and apr-util trees so that it becomes part of the standard Apache
distribution. Since I don't have commit access, please don't discuss
the merits of mod_mbox's inclusion with me (I'm biased anyway). =-)
I do think a lot of sites would find this incredibly useful - in my
opinion, apache.org is number one on this list.
Note that we intend to convert parts of the display logic to filters,
but that really shouldn't affect the majority of the mbox code and what
it displays (just how). I think this is a good time to gauge feedback of
what we have so far.
Now, to provide an overview of the mod_mbox module (functionally and
architecturally):
There are two real components to mod_mbox. The first is mod_mbox.c
which is the actual Apache module. Currently, there is not much to
this file - it is basically a wrapper around the other files. This
file handles the displaying of the actual message. mod_mbox is
intended to be a handler ("mbox-file" and "mbox-handler") and
produces a "virtual namespace" from which the user can browse in.
There are two main URIs of interest for each mbox:
http://foo.example.com/your.mbox/index.html
http://foo.example.com/your.mbox/threads.html
The default index is sorted by date, and the threading index is
sorted by date as well. (I'll explain how the threading works
later.) The indexes provide links based on the message-id into the
mbox file of the format:
http://foo.example.com/your.mbox/message-id
All of the other files constitute the core of the mbox functionality
(parsing, threading, sorting, etc.). My intention is that these could
be placed within apr-util. mod_mbox uses DBMs to "cache" all of the
relevant information about the mbox (date, subject, from, references,
offset within the original file, etc.). This makes the display of the
index and retrieval of a message fairly efficient while retaining the
original archive.
Note that I have only tried it with the SDBM included in apr-util - I
imagine that it'd work with Sleepycat DB and GDBM (apr-dbm has hooks
for these, but part of this project was to test out the
httpd/apr/apr-util code).
The other key functionality is the threading algorithm. I based
my threading implementation off of Jamie Zawinski's mail threading
algorithms (he wrote the original versions of Netscape Mail - see
http://www.jwz.org/doc/threading.html). His key point was not to
store the threading tree in the database, but generate the tree on
the fly. It has proved to be very efficient and highly accurate.
Note that I did not use any of his code - I only used his description
of the algorithm. This portion of the code is quite complex (although
I wrote it in a span of 24 hours). I have managed to test it with
threads I know (with our internal mailing lists) and it seems reasonably
accurate. Subtle bugs may still exist. If you find a bug, any help
tracking these down would be greatly appreciated.
For the rest of the implementation details, please see the source code.
Open source is nice that way.
I look forward to hearing any comments or suggestions ya'll might have.
Thanks in advance,
Justin Erenkrantz
[EMAIL PROTECTED]