fulin tang wrote:
We are going to add full-text search for our mailbox service .

The problem is we have more than 1 PB mails there , and obviously we
don't want to add another PB storage for search service , so we hope
the index data will be small enough for storage while the search keeps
fast .

The lucky is that every user just search with mails of their own , so
we can split the data into a lot of indexes instead of keeping them in
a big one .
If it is going to be sharded by the 'To' or 'Cc' list - then potentially the mail information is going to be duplicated proportional to the number of people in an email thread. Selecting some other dimension like time, for sharding might be useful to begin with.
So, after all these concerns ,  the question is , is lucene a good
choice for this ? or which is the right way to do this ? Does anyone
have done this  before ?

With PB of storage - check out solr sharding / katta for prior work in this arena.

All opinions and comments are welcome !

fulin




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to