Hi all, This may be a bit rambling, but let see how it goes. I'm not a Lucene or Solr guru by any means, I have been prototyping with solr and understanding how all the pieces and parts fit together.
We are migrating our current document storage infrastructure to a decent sized solr cluster, using 1.3-snapshots right now. Eventually this will be in the billion+ documents, with about 1M new documents added per day. Our main sticking point right now is that a significant number of our documents will be updated, at least once, but possibly more than once. The volatility of a document decreases over time. With this in mind, we've been considering using a cascading series of shard clusters. That is : 1) a cluster of shards holding recent data ( most recent week or two ) smaller indexes that take a small amount of time to commit updates and optimise, since this will hold the most volatile documents. 2) Following that another cluster of shards that holds some relatively recent ( 3-6 months ? ), but not super volatile, documents, these are items that could potentially receive updates, but generally not. 3) A final set of 'archive' shards holding the final resting place for documents. These would not receive updates. These would be online for searching and analysis "forever". We are not sure if this is the best way to go, but it is the approach we are leaning toward right now. I would like some feedback from the folks here if you think that is a reasonable approach. One of the other things I'm wondering about is how to manipulate indexes We'll need to roll documents around between indexes over time, or at least migrate indexes from one set of shards to another as the documents 'age' and merge/aggregate them with more 'stable' indexes. I know about merging complete indexes together, but what about migrating a subset of documents from one index into another index? In addition, what is generally considered a 'manageable' index of large size? I was attempting to find some information on the relationship between search response times, the amount of memory for used for a search, and the number of documents in an index, but I wasn't having much luck. I'm not sure if I'm making sense here, but just thought I would throw this out there and see what people think. Ther eis the distinct possibility that I am not asking the right questions or considering the right parameters, so feel free to correct me, or ask questions as you see fit. And yes, I will report how we are doing things when we get this all figured out, and if there are items that we can contribute back to Solr we will. If nothing else there will be a nice article of how we manage TB of data with Solr. enjoy, -jeremy -- ======================================================================== Jeremy Hinegardner [EMAIL PROTECTED]