[Lucene-hadoop Wiki] Trivial Update of "DistributedLucene" by MarkButler

Apache Wiki Wed, 19 Dec 2007 03:06:46 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by MarkButler:
http://wiki.apache.org/lucene-hadoop/DistributedLucene

------------------------------------------------------------------------------
  {{{
  public interface ClientToDataNodeProtocol extends VersionedProtocol {
    void addDocument(String index, Document doc) throws IOException;
+   int removeDocuments(String index, Term term) throws IOException; // Change 
here, Doug suggested int[] but that is different to current Lucene API
- 
-   // Change here, Doug suggested int[] but that is different
-   // to current Lucene API
- 
-   int removeDocuments(String index, Term term) throws IOException;
    IndexVersion commitVersion(String index) throws IOException;
  
    // batch update
  
-   void addIndex(String index) throws IOException;
+   void addIndex(String index) throws IOException; // Shouldn't this be called 
createIndex ?
    void addIndex(String index, IndexLocation indexToAdd) throws IOException;
  
    // search
@@ -67, +63 @@

  {{{
  public interface DataNodeToDataNodeProtocol extends VersionedProtocol {
    String[] getFileSet(IndexVersion indexVersion) throws IOException;
+   byte[] getFileContent(IndexVersion indexVersion, String file) throws 
IOException; // based on experience in Hadoop we probably wouldn't really use 
RPC to find file content, instead HTTP
-   byte[] getFileContent(IndexVersion indexVersion, String file)
-       throws IOException;
-   // based on experience in Hadoop we probably wouldn't really use
-   // RPC to find file content, instead HTTP
  }
  }}}
  
@@ -97, +90 @@

  
  Design the client API. 
  
+ One of the issues here is whether sharding should be handled solely at the 
client, using the API defined above. For example you could have myindex-1, 
myindex-2 and myindex-3 are the shards of my-index. However then the client 
takes responsibility for sharding, and the Master and Workers know nothing 
about it. The other approach would be to extend the API outlined above so that 
it knows about shards, so that the workers store metadata about the 
relationship between shards, which is then sent to the master, so the client 
can query it rather than inferring it. 
+ 
+ To insert data, use a consistent hashing algorithm as described here 
http://problemsworthyofattack.blogspot.com/2007/11/consistent-hashing.html
+ 
+ Then provide a query operation which calls all the shards.
+ 
+ Here is a proposal for the client API:
+ 
+ {{{
+ public interface ClientAPI {
+ 
+   void createIndex(String index, boolean sharded) throws IOException;
+ 
+   // Use IndexVersion because the client API does not need to know where the 
data is
+ 
+   IndexVersion[] getSearchableIndexes();
+   IndexVersion[] getUpdateableIndexes();
+   void addIndex(String index, IndexVersion indexToAdd) throws IOException;
+   void addDocument(String index, Document doc) throws IOException;
+   int removeDocuments(String index, Term term) throws IOException; // Change 
here, Doug suggested int[] but that is different to current Lucene API
+   IndexVersion commit(String index) throws IOException;
+   SearchResults search(IndexVersion i, Query query, Sort sort, int n) throws 
IOException;
+ }
+ }}}
+

[Lucene-hadoop Wiki] Trivial Update of "DistributedLucene" by MarkButler

Reply via email to