Hi all,

This may be a bit rambling, but let see how it goes.  I'm not a Lucene or Solr
guru by any means, I have been prototyping with solr and understanding how all
the pieces and parts fit together.

We are migrating our current document storage infrastructure to a decent sized
solr cluster, using 1.3-snapshots right now.  Eventually this will be in the
billion+ documents, with about 1M new documents added per day.  

Our main sticking point right now is that a significant number of our documents
will be updated, at least once, but possibly more than once.  The volatility of
a document decreases over time.

With this in mind, we've been considering using a cascading series of shard
clusters.  That is :

 1) a cluster of shards holding recent data ( most recent week or two ) smaller
    indexes that take a small amount of time to commit updates and optimise,
    since this will hold the most volatile documents.

 2) Following that another cluster of shards that holds some relatively recent
    ( 3-6 months ? ), but not super volatile, documents, these are items that
    could potentially receive updates, but generally not.

 3) A final set of 'archive' shards holding the final resting place for
    documents.  These would not receive updates.  These would be online for
    searching and analysis "forever".

We are not sure if this is the best way to go, but it is the approach we are
leaning toward right now.  I would like some feedback from the folks here if you
think that is a reasonable approach.

One of the other things I'm wondering about is how to manipulate indexes
We'll need to roll documents around between indexes over time, or at least
migrate indexes from one set of shards to another as the documents 'age' and
merge/aggregate them with more 'stable' indexes.   I know about merging complete
indexes together, but what about migrating a subset of documents from one index
into another index?

In addition, what is generally considered a 'manageable' index of large size?  I
was attempting to find some information on the relationship between search
response times, the amount of memory for used for a search, and the number of
documents in an index, but I wasn't having much luck.  

I'm not sure if I'm making sense here, but just thought I would throw this out
there and see what people think.  Ther eis the distinct possibility that I am
not asking the right questions or considering the right parameters, so feel free
to correct me, or ask questions as you see fit.

And yes, I will report how we are doing things when we get this all figured out,
and if there are items that we can contribute back to Solr we will.  If nothing
else there will be a nice article of how we manage TB of data with Solr.

enjoy,

-jeremy

-- 
========================================================================
 Jeremy Hinegardner                              [EMAIL PROTECTED] 

Reply via email to