Doug Cutting
Wed, 18 Oct 2006 14:18:02 -0700
Here are some quick ideas about how this might work.An RPC mechanism would be used to communicate between nodes (probably Hadoop's). The system would be configured with a single master node that keeps track of where indexes are located, and a number of slave nodes that would maintain, search and replicate indexes. Clients would talk to the master to find out which indexes to search or update, then they'll talk directly to slaves to perform searches and updates.
Following is an outline of how this might look.We assume that, within an index, a file with a given name is written only once. Index versions are sets of files, and a new version of an index is likely to share most files with the prior version. Versions are numbered. An index server should keep old versions of each index for a while, not immediately removing old files.
public class IndexVersion {
String Id; // unique name of the index
int version; // the version of the index
}
public class IndexLocation {
IndexVersion indexVersion;
InetSocketAddress location;
}
public interface ClientToMasterProtocol {
IndexLocation[] getSearchableIndexes();
IndexLocation getUpdateableIndex(String id);
}
public interface ClientToSlaveProtocol {
// normal update
void addDocument(String index, Document doc);
int[] removeDocuments(String index, Term term);
void commitVersion(String index);
// batch update
void addIndex(String index, IndexLocation indexToAdd);
// search
SearchResults search(IndexVersion i, Query query, Sort sort, int n);
}
public interface SlaveToMasterProtocol {
// sends currently searchable indexes
// recieves updated indexes that we should replicate/update
public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
}
public interface SlaveToSlaveProtocol {
String[] getFileSet(IndexVersion indexVersion);
byte[] getFileContent(IndexVersion indexVersion, String file);
// based on experience in Hadoop, we probably wouldn't really use
// RPC to send file content, but rather HTTP.
}
The master thus maintains the set of indexes that are available for
search, keeps track of which slave should handle changes to an index and
initiates index synchronization between slaves. The master can be
configured to replicate indexes a specified number of times.
The client library can cache the current set of searchable indexes and periodically refresh it. Searches are broadcast to one index with each id and return merged results. The client will load-balance both searches and updates.
Deletions could be broadcast to all slaves. That would probably be fast enough. Alternately, indexes could be partitioned by a hash of each document's unique id, permitting deletions to be routed to the appropriate slave.
Does this make sense? Does it sound like it would be useful to Solr? To Nutch? To others? Who would be interested and able to work on it?
Doug