It seems that Nutch and Solr would benefit from a shared index serving infrastructure. Other Lucene-based projects might also benefit from this. So perhaps we should start a new project to build such a thing. This could start either in java/contrib, or as a separate sub-project, depending on interest.

Here are some quick ideas about how this might work.

An RPC mechanism would be used to communicate between nodes (probably Hadoop's). The system would be configured with a single master node that keeps track of where indexes are located, and a number of slave nodes that would maintain, search and replicate indexes. Clients would talk to the master to find out which indexes to search or update, then they'll talk directly to slaves to perform searches and updates.

Following is an outline of how this might look.

We assume that, within an index, a file with a given name is written only once. Index versions are sets of files, and a new version of an index is likely to share most files with the prior version. Versions are numbered. An index server should keep old versions of each index for a while, not immediately removing old files.

public class IndexVersion {
  String Id;   // unique name of the index
  int version; // the version of the index
}

public class IndexLocation {
  IndexVersion indexVersion;
  InetSocketAddress location;
}

public interface ClientToMasterProtocol {
  IndexLocation[] getSearchableIndexes();
  IndexLocation getUpdateableIndex(String id);
}

public interface ClientToSlaveProtocol {
  // normal update
  void addDocument(String index, Document doc);
  int[] removeDocuments(String index, Term term);
  void commitVersion(String index);

  // batch update
  void addIndex(String index, IndexLocation indexToAdd);

  // search
  SearchResults search(IndexVersion i, Query query, Sort sort, int n);
}

public interface SlaveToMasterProtocol {
  // sends currently searchable indexes
  // recieves updated indexes that we should replicate/update
  public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
}

public interface SlaveToSlaveProtocol {
  String[] getFileSet(IndexVersion indexVersion);
  byte[] getFileContent(IndexVersion indexVersion, String file);
  // based on experience in Hadoop, we probably wouldn't really use
  // RPC to send file content, but rather HTTP.
}

The master thus maintains the set of indexes that are available for search, keeps track of which slave should handle changes to an index and initiates index synchronization between slaves. The master can be configured to replicate indexes a specified number of times.

The client library can cache the current set of searchable indexes and periodically refresh it. Searches are broadcast to one index with each id and return merged results. The client will load-balance both searches and updates.

Deletions could be broadcast to all slaves. That would probably be fast enough. Alternately, indexes could be partitioned by a hash of each document's unique id, permitting deletions to be routed to the appropriate slave.

Does this make sense? Does it sound like it would be useful to Solr? To Nutch? To others? Who would be interested and able to work on it?

Doug

Reply via email to