[Lucene-hadoop Wiki] Trivial Update of "DistributedLucene" by MarkButler

Apache Wiki Wed, 19 Dec 2007 06:51:44 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by MarkButler:
http://wiki.apache.org/lucene-hadoop/DistributedLucene

------------------------------------------------------------------------------
  
  === Issues to be discussed ===
  
- ==== 6. Broadcasts versus IPC ====
+ ==== 1. Broadcasts versus IPC ====
  
  Currently Hadoop does not support broadcasts, and there are problems getting 
broadcasts to work across clusters. Do we need to use broadcasts or can we use 
the same approach as HDFS and Hbase?
  
  Current approach: does not use broadcasts. 
  
- ==== 1. How do searches work? ====
+ ==== 2. How do searches work? ====
  
- Searches could be broadcast to one index with each id and return merged 
results. The client will load-balance both searches and updates.
+ '''Searches could be broadcast to one index with each id and return merged 
results. The client will load-balance both searches and updates.'''
  
- Current approach: Sharding is implemented in client API. Currently the master 
and the workers know nothing about shards. Client gets a list of all indexes, 
then selects replicas at random to query (load balancing). They return results 
and the client API aggregates them. 
+ Current approach: Sharding is implemented in client API. Currently the master 
and the workers know nothing about shards. If a client needs to shard an index, 
it needs to take responsibility, perhaps using some convention to store the 
index name and shard as a composite value in the index ID. To perform a search, 
the client gets a list of all indexes, then selects replicas at random to query 
to perform load balancing. The workers return results and the client API 
aggregates them. 
  
- ==== 2. How do deletions work? ====
+ ==== 3. How do deletions work? ====
  
- Deletions could be broadcast to all slaves. That would probably be fast 
enough. Alternately, indexes could be partitioned by a hash of each document's 
unique id, permitting deletions to be routed to the appropriate slave.
+ ''Deletions could be broadcast to all slaves. That would probably be fast 
enough. Alternately, indexes could be partitioned by a hash of each document's 
unique id, permitting deletions to be routed to the appropriate slave.''
  
- Current approach: On non-sharded indexes, deletions are sent directly to the 
worker. On sharded ones, they work like searchers described above. 
+ Current approach: On non-sharded indexes, deletions are sent directly to the 
worker. On sharded ones, they work like searches described above. 
  
- ==== 3. How does update work? ====
+ ==== 4. How does update work? ====
  
- The master should be out of the loop as much as possible. One approach is 
that clients randomly assign documents to indexes and send the updates directly 
to the indexing node. Alternately, clients might index locally, then ship the 
updates to a node packaged as an index. That was the intent of the addIndex 
method.
+ ''The master should be out of the loop as much as possible. One approach is 
that clients randomly assign documents to indexes and send the updates directly 
to the indexing node. 
  
  One potental problem is a document overwrite implemented as a delete then an 
add. More than one client doing this for the same document could result in 0 or 
2 documents, instead of 1.  I guess clients will just need to be relatively 
coordinated in their activities. Either the two clients must coordinate, to 
make sure that they're not updating the same document at the same time, or use 
a strategy where updates are routed to the slave that contained the old version 
of the document. That would require a broadcast query to figure out which slave 
that is.
  
- Good point. Either the two clients must coordinate, to make sure that they're 
not updating the same document at the same time, or use a strategy where 
updates are routed to the slave that contained the old version of the document. 
That would require a broadcast query to figure out which slave that is.
+ Good point. Either the two clients must coordinate, to make sure that they're 
not updating the same document at the same time, or use a strategy where 
updates are routed to the slave that contained the old version of the document. 
That would require a broadcast query to figure out which slave that is.''
  
- ==== 4. How do additions work? ====
+ ==== 5. How do additions work? ====
  
  The master should not be involved in adds. Clients can cache the set of 
writable index locations and directly submit new documents without involving 
the master.
  
- The master should be out of the loop as much as possible. One approach is 
that clients randomly assign documents to indexes and send the updates directly 
to the indexing node. Alternately, clients might index locally, then ship the 
updates to a node packaged as an index. That was the intent of the addIndex 
method.
+ The master should be out of the loop as much as possible. One approach is 
that clients randomly assign documents to indexes and send the updates directly 
to the indexing node. '''Alternately, clients might index locally, then ship 
the updates to a node packaged as an index. That was the intent of the addIndex 
method.'''
  
- ==== 5. How do commits work? ====
+ ==== 6. How do commits work? ====
  
- It seems like the master might want to be involved in commits too, or maybe 
we just rely on the slave to master heartbeat to kick of immediately after a 
commit so that index replication can be initiated? I like the latter approach. 
New versions are only published as frequently as clients poll the master for 
updated IndexLocations. Clients keep a cache of both readable and updatable 
index locations that are periodically refreshed.
+ ''It seems like the master might want to be involved in commits too, or maybe 
we just rely on the slave to master heartbeat to kick of immediately after a 
commit so that index replication can be initiated? I like the latter approach. 
New versions are only published as frequently as clients poll the master for 
updated IndexLocations. Clients keep a cache of both readable and updatable 
index locations that are periodically refreshed.''
  
  ==== 7. Finding updateable indexes ====
  
- Looking at 
+ ''Looking at''
  {{{
         IndexLocation[] getUpdateableIndex(String[] id);
  }}}
- I'd assumed that the updateable version of an index does not move around very 
often. Perhaps a lease mechanism is required. For example, a call to 
getUpdateableIndex might be valid for ten minutes.
+ ''I'd assumed that the updateable version of an index does not move around 
very often. Perhaps a lease mechanism is required. For example, a call to 
getUpdateableIndex might be valid for ten minutes.''
  
  ==== 8. What is an Index ID? ====
  
- But what is index id exactly?  Looking at the example API you laid down, it 
must be a single physical index (as opposed to a logical index).  In which 
case, is it entirely up to the client to manage multi-shard indicies?  For 
example, if we had a "photo" index broken up into 3 shards, each shard would 
have a separate index id and it would be up to the client to know this, and to 
query across the different "photo0", "photo1", "photo2" indicies.  The master 
would
+ ''But what is index id exactly?  Looking at the example API you laid down, it 
must be a single physical index (as opposed to a logical index).  In which 
case, is it entirely up to the client to manage multi-shard indicies?  For 
example, if we had a "photo" index broken up into 3 shards, each shard would 
have a separate index id and it would be up to the client to know this, and to 
query across the different "photo0", "photo1", "photo2" indicies.  The master 
would
  have no clue those indicies were related.  Hmmm, that doesn't work very well 
for deletes though.
  
  It seems like there should be the concept of a logical index, that is 
composed of multiple shards, and each shard has multiple copies.
  
- Or were you thinking that a cluster would only contain a single logical 
index, and hence all different index ids are simply different shards of that 
single logical index?  That would seem to be consistent with 
ClientToMasterProtocol .getSearchableIndexes() lacking an id argument.
+ Or were you thinking that a cluster would only contain a single logical 
index, and hence all different index ids are simply different shards of that 
single logical index?  That would seem to be consistent with 
ClientToMasterProtocol .getSearchableIndexes() lacking an id argument.''
  
  ==== 9. What about SOLR? ====
  
- It depends on the project scope and how extensible things are. It seems like 
the master would be a WAR, capable of running stand-alone. What about index 
servers (slaves)?  Would this project include just the interfaces to be 
implemented by Solr/Nutch nodes, some common implementation code behind the 
interfaces in the form of a library, or also complete standalone WARs?
+ ''It depends on the project scope and how extensible things are. It seems 
like the master would be a WAR, capable of running stand-alone. What about 
index servers (slaves)?  Would this project include just the interfaces to be 
implemented by Solr/Nutch nodes, some common implementation code behind the 
interfaces in the form of a library, or also complete standalone WARs?
  
- I'd need to be able to extend the ClientToSlave protocol to add additional 
methods for Solr (for passing in extra parameters and returning various extra 
data such as facets, highlighting, etc).
+ I'd need to be able to extend the ClientToSlave protocol to add additional 
methods for Solr (for passing in extra parameters and returning various extra 
data such as facets, highlighting, etc).''
  
  ==== 10. How does versioning work? ====
  
- Could this be done in Lucene? It would also need a way to open a specific 
index version (rather than just the latest), but I guess that could also be 
hacked into Directory by hiding later "segments" files (assumes lockless is 
committed).
+ ''Could this be done in Lucene? It would also need a way to open a specific 
index version (rather than just the latest), but I guess that could also be 
hacked into Directory by hiding later "segments" files (assumes lockless is 
committed).''
  
  === Mark's comments ===

[Lucene-hadoop Wiki] Trivial Update of "DistributedLucene" by MarkButler

Reply via email to