[PROPOSAL] index server project

Doug Cutting Wed, 18 Oct 2006 14:18:02 -0700

It seems that Nutch and Solr would benefit from a shared index servinginfrastructure. Other Lucene-based projects might also benefit fromthis. So perhaps we should start a new project to build such a thing.This could start either in java/contrib, or as a separate sub-project,depending on interest.


Here are some quick ideas about how this might work.

An RPC mechanism would be used to communicate between nodes (probablyHadoop's). The system would be configured with a single master nodethat keeps track of where indexes are located, and a number of slavenodes that would maintain, search and replicate indexes. Clients wouldtalk to the master to find out which indexes to search or update, thenthey'll talk directly to slaves to perform searches and updates.


Following is an outline of how this might look.

We assume that, within an index, a file with a given name is writtenonly once. Index versions are sets of files, and a new version of anindex is likely to share most files with the prior version. Versionsare numbered. An index server should keep old versions of each indexfor a while, not immediately removing old files.


public class IndexVersion {
  String Id;   // unique name of the index
  int version; // the version of the index
}

public class IndexLocation {
  IndexVersion indexVersion;
  InetSocketAddress location;
}

public interface ClientToMasterProtocol {
  IndexLocation[] getSearchableIndexes();
  IndexLocation getUpdateableIndex(String id);
}

public interface ClientToSlaveProtocol {
  // normal update
  void addDocument(String index, Document doc);
  int[] removeDocuments(String index, Term term);
  void commitVersion(String index);

  // batch update
  void addIndex(String index, IndexLocation indexToAdd);

  // search
  SearchResults search(IndexVersion i, Query query, Sort sort, int n);
}

public interface SlaveToMasterProtocol {
  // sends currently searchable indexes
  // recieves updated indexes that we should replicate/update
  public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
}

public interface SlaveToSlaveProtocol {
  String[] getFileSet(IndexVersion indexVersion);
  byte[] getFileContent(IndexVersion indexVersion, String file);
  // based on experience in Hadoop, we probably wouldn't really use
  // RPC to send file content, but rather HTTP.
}

The master thus maintains the set of indexes that are available forsearch, keeps track of which slave should handle changes to an index andinitiates index synchronization between slaves. The master can beconfigured to replicate indexes a specified number of times.

The client library can cache the current set of searchable indexes andperiodically refresh it. Searches are broadcast to one index with eachid and return merged results. The client will load-balance bothsearches and updates.

Deletions could be broadcast to all slaves. That would probably be fastenough. Alternately, indexes could be partitioned by a hash of eachdocument's unique id, permitting deletions to be routed to theappropriate slave.

Does this make sense? Does it sound like it would be useful to Solr?To Nutch? To others? Who would be interested and able to work on it?


Doug

[PROPOSAL] index server project

Reply via email to