Doug Cutting wrote:
It seems that Nutch and Solr would benefit from a shared index serving
An RPC mechanism would be used to communicate between nodes (probably
Hadoop's). The system would be configured with a single master node
that keeps track of where indexes are located, and a number of slave
nodes that would maintain, search and replicate indexes. Clients would
talk to the master to find out which indexes to search or update, then
they'll talk directly to slaves to perform searches and updates.
Does this make sense? Does it sound like it would be useful to Solr? To
Nutch? To others? Who would be interested and able to work on it?
Is there any way this could be generalized so that resources
other than Lucene indexes could be packaged up and distributed?
The reason I ask is that we have customers who are using
Lucene and SOLR and would like to pass other bits of their
applications around in the same way, including things we've
built from indexed data like spelling checkers, background
models for statistically interesting phrase detectors, statistical
models for topic/tag classifiers that get retrained as users
add more tags, language identifiers, etc.
From what I understand of Doug's proposal as well as
what I've seen in SOLR, there's not much that's actually
Lucene-specific about all this client/master/slave synching
other than that the data's a Lucene index.
I imagine this could be done with a generalization of the
kinds of callbacks found in SOLR, or by making what gets
passed around configurable in the proposed index server
I'd be happy to test and help with API-level design/doc; I
don't know much about distribution mechanics, though, which
is why I'm so interested in this high level abstraction.
- Bob Carpenter