On 10/19/06, Steven Parkes <[EMAIL PROTECTED]> wrote:
You mention partitioning of indexes, though mostly around delete. What about scalability of corpus size?
Definitely in scope. Solr already has scalability of search volume via searchers behind of a load balancer all getting their index from a master. The problem comes when an index is too big to get decent latency for a single query, and that's when you need to partiton the index into "shards" to use google terminology.
Would partitioning be effective for that, too?
Yes, to a certain extent. At some point you run into network bandwidth issues if you go deep into rankings.
What about scalability of ingest rate?
As it relates to indexing, I think nutch already has that base covered.
What are you thinking, in terms of size? Is this a 10 node thing?
I'm personally interested in perhaps 10 to 20 index shards, with multiple replicas of each shard for HA and query load scalability.
A 1000 node thing? More? Bigger is cool, but raises a lot of issues.
Should be possible, but I won't personally be looking for that. I think scaling effectively will be partially in the hands of the client and how it chooses to merge results from shards.
How dynamic?
Can nodes come and go?
Unplanned: yes. HA is personally key for me. Planned (adding capacity gracefully): it would be nice. I actually hadn't planned it for Solr.
Are you going to assume homogeneity of nodes?
Hardware homogeneity? That might be out of scope... I'd start off without worrying about it in any case.
What about add/modify/delete to search visibility latency? Close to batch/once-a-day or real-time?
Anywhere in between I'd think. "Realtime" latencies of minutes or longer are normally fine. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server