We're planning to archive email over many years and have been looking at using
DB to store mail meta data and Lucene for the indexed mail data, or just Lucene
on its own with email data and structure stored as XML and the raw message
stored in the file system.
For some customers, the volumes are likely to be well over 1 billion mails over
10 years, so some partitioning of data is needed. At the moment the thoughts
are moving away from using a DB + Lucene to just Lucene along with a file system
representation of the complete message. All searches will be against the index
then the XML mail meta data is loaded from the file system.
The archive is read only apart from bulk deletes, but one of the requirements is
for users to be able to label their own mail. Given that a Lucene Document
cannot be updated, I have thought about having a separate Lucene index that has
just the 3 terms (or some combination of) userId + mailId + label.
That of course would mean joining searches from the main mail data index and the
label index.
Does anyone have any experience of using Lucene this way and is it a realistic
option of avoiding the DB at all? I'd rather the headache of scaling just
Lucene, which is a simple beast, than the whole bundle of 'stuff' that comes
with the database as well.
Antony
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]