Thank you all for the comments, it has been really helpful in coming up with a more concrete plan. I'll reply to this mail which is a bit more detailed.

Danny Angus wrote:
Hi,

The mail is conciously written out in a single field to avoid the overhead of unneccesary processing
...
On the whole the only action James takes is to re-create a Mail Object from this data.
Reading this, I came to think that actually the indexing could/in some cases should be done on a separate thread. This means that the mail could still be stored in the fastest way possible, and indexed later.

On the other hand you might also use Lucene.
So I looked up Lucene... but more on this below.

As far as splitting the message up is concerned with Mime supporting a nested structure it is not really practical to attempt to flatten this out, it would require a relational datastructure and recursive processing which are too expensive, in performance, without some good justification. Extracting common headers would be less expensive, but simply because James itself has no need of it we haven't provided it.
Ok, so this means, reinforcing the first comment, that an enhanced DB repo could store headers. This is still not enough for a full search capability, so we would still ned to index those with Lucene, so that a mixed headers+message_body can be done.

There is a third, much more time consuming route which you may or may not like, which is to create your own repository class.
(which we'd love to see if you wanted to contribute it)
Why? What would be the need, given that a Lucene index could do it all?

The architecture is intended to be extensible in this respect, but in practice it may not be as straightforward as it seems.

Extend org.apache.james.mailrepository.JDBCRepository and override store() and retrieve() to support a different data structure.
As I hinted this class isn't well suited to having these methods overriden, it could use some re-factoring, and you may well need to cut'n'paste some of the code to get connections from James' pool and such like, but it would probably give you a more complete solution than just using TEXT.

Be warned, however, that repositories are not (yet?) part of the Mailet API, they're internal to James, and future changes to configuration and more complex repository interfaces may break your repository without warning.
(Its one of my goals for v3 to add repository specifictions to the Mailet API, but this might not help repository developers only mailet developers)

To use your implementation you would then edit config.xml and either replace the classname, or create a whole new protocol for your repository, and specify it as normal for your archive mailet using a URL.
this is the node you're looking for:
<repository class="org.apache.james.mailrepository.JDBCMailRepository">
<protocols>
<protocol>db</protocol> </protocols>
<types>
<type>MAIL</type> </types>
<config>
<sqlFile>file://conf/sqlResources.xml</sqlFile> </config>
</repository>

I've taken a deep link at the JDBCRepository, and it can be done, although as you seem to imply, in a hackish cut'npaste way.

My final decision is to make a mailet that indexes the passing mail with Lucene, so that the lucene index can be used to search the mails in full.

Later, when it's all working, we can decouple the indexing and make it asynch, so that it triggers at defined times and indexes new mails.

What does this look like?

--
Nicola Ken Barozzi [EMAIL PROTECTED]
- verba volant, scripta manent -
(discussions get forgotten, just code remains)
---------------------------------------------------------------------


--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to