On 10/19/06, Steven Parkes <[EMAIL PROTECTED]> wrote:
You mention partitioning of indexes, though mostly around delete. What
about scalability of corpus size?

Definitely in scope.  Solr already has scalability of search volume
via searchers behind of a load balancer all getting their index from a
master.  The problem comes when an index is too big to get decent
latency for a single query, and that's when you need to partiton the
index into "shards" to use google terminology.

Would partitioning be effective for
that, too?

Yes, to a certain extent.  At some point you run into network
bandwidth issues if you go deep into rankings.

What about scalability of ingest rate?

As it relates to indexing, I think nutch already has that base covered.

What are you thinking, in terms of size? Is this a 10 node thing?

I'm personally interested in perhaps 10 to 20 index shards, with
multiple replicas of each shard for HA and query load scalability.

A 1000
node thing? More? Bigger is cool, but raises a lot of issues.

Should be possible, but I won't personally be looking for that.  I
think scaling effectively will be partially in the hands of the client
and how it chooses to merge results from shards.

How
dynamic?

Can nodes come and go?

Unplanned: yes.  HA is personally key for me.
Planned (adding capacity gracefully): it would be nice.  I actually
hadn't planned it for Solr.

Are you going to assume homogeneity of
nodes?

Hardware homogeneity?  That might be out of scope... I'd start off
without worrying about it in any case.

What about add/modify/delete to search visibility latency? Close to
batch/once-a-day or real-time?

Anywhere in between I'd think.  "Realtime" latencies of minutes or
longer are normally fine.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

Reply via email to