[Lucene-hadoop Wiki] Trivial Update of "DistributedLucene" by MarkButler

Apache Wiki Mon, 14 Jan 2008 07:23:54 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by MarkButler:
http://wiki.apache.org/lucene-hadoop/DistributedLucene

------------------------------------------------------------------------------
  }
  }}}
  
- == Implementation Notes ==
+ === Implementation Notes ==
  
- === Issues to be discussed ===
+ === Current Status ===
+ 
+ Currently there is an alpha implementation of the design outlined above 
specifically the master, worker, client library and unit tests. This is 
awaiting review by HP for contribution to the Apache Foundation.
+ 
+ Rather than using HDFS, the implementation (DLucene) is heavily inspired by 
HDFS. This is because the files uses in Lucene indexes are quite different from 
the files that HDFS was designed for. It uses a similar replication algorithm 
to HDFS, and where possible HDFS uses code although it was necessary to make 
some local changes to the visibility of some classes and methods. 
+ 
+ Unlike HDFS it currently uses a state less Master. In the event of a failure, 
the heartbeat information sent by each worker contains a list of all indexes 
they own, and also the current status of those indexes. This means it should be 
possible to swap over masters. However the disadvantage is this will result in 
more network traffic per heartbeat.
+ 
+ Both the master and workers have a heart beat architecture. Workers have 
three types of threads: one to service requests, one to send heartbeats to the 
master to inform it that the worker is alive, and one to process replication 
tasks. Workers have two types of threads: one to service requests, the other to 
perform failure detection and compute a replication plan. A segment of this 
plan is then sent back to each worker in response to their heartbeat.
+ 
+ One of the aims of the implementation is to better understand how Hadoop 
works, so that it is possible to create an architecture to simplify the 
creation of other specialized storage or processing components for Hadoop. 
+ 
+ There are also a number of outstanding items of functionality:
+ 
+    * There is no process to delete old versions of indexes after a 
predetermined time, as in HDFS. 
+    * The implementation does not take advantage of Lucene's RAM based indexes 
to improve efficient.
+    * In HDFS there is a "throttler" to control client requests. There is no 
equivalent functionlity in DLucene.
+    * A cluster contains a number of replicas of an index, to support high 
availability / load balancing. When a client writes to an index, the updates 
are sent to one replica which creates a new version. When the changes are 
committed, the new version is then propagated to other machines in the cluster. 
However if several clients update the same index at the same time, this needs 
to be synchronized to the same replica. I haven't worked out a way of doing 
this yet. See below for some discussion of some design alternatives.
+    * Replication takes advantage of the fact that it is quicker if a worker 
has an old copy of an index. However the replication assignment algorithm does 
not yet take advantage of this. 
+    * No benchmarks yet.
+ 
+ === Design decisions ===
  
  ==== 1. Broadcasts versus IPC ====
  
- In the previous design discussion, there was talk about using broadcasts. 
However Hadoop does not support broadcasts, and there are problems getting 
broadcasts to work across clusters. Do we need to use broadcasts or can we use 
the same approach as HDFS and Hbase?
+ In the original design discussion by Doug Cutting and Yonik Seeley, 
broadcasts were considered for deletes and searches. The current implementation 
uses basic Hadoop IPC rather than broadcasts. 
  
- Current approach: do not use broadcasts. 
+ ==== 2. Sharding ====
  
+ The masters and workers know nothing about sharding. This is done at the 
client. This is to simplify the master / worker implementations. 
+ 
- ==== 2. How do searches work? ====
+ ==== 3. How do searches work? ====
  
  ''Searches could be broadcast to one index with each id and return merged 
results. The client will load-balance both searches and updates.''
  
@@ -139, +162 @@

  
  In the sharded case, it does the same thing but queries one replica of each 
shard. The workers return results and the client API aggregates them. 
  
- ==== 3. How do deletions work? ====
+ ==== 4. How do deletions work? ====
  
  ''Deletions could be broadcast to all slaves. That would probably be fast 
enough. Alternately, indexes could be partitioned by a hash of each document's 
unique id, permitting deletions to be routed to the appropriate slave.''
  
  Current approach: On non-sharded indexes, deletions are sent directly to the 
worker. On sharded ones, they work like searches described above. 
  
- ==== 4. How does update work? ====
+ ==== 5. How does update work? ====
  
  ''The master should be out of the loop as much as possible. One approach is 
that clients randomly assign documents to indexes and send the updates directly 
to the indexing node. 
  
@@ -157, +180 @@

  
  Therefore there is a danger that two clients could edit the same index at the 
same time. One possibility here would be to bind a particular version of an 
index to a particular client. If the client fails that is not a problem, the 
changes are just uncommitted. However there is still a danger of a race 
condition when there are two different branches of the same index. 
  
- ==== 5. How do additions work? ====
+ ==== 6. How do additions work? ====
  
  The master should not be involved in adds. Clients can cache the set of 
writable index locations and directly submit new documents without involving 
the master.
  
  The master should be out of the loop as much as possible. One approach is 
that clients randomly assign documents to indexes and send the updates directly 
to the indexing node. '''Alternately, clients might index locally, then ship 
the updates to a node packaged as an index. That was the intent of the addIndex 
method.'''
  
- TODO: Think how to do this in the client API. 
- 
- ==== 6. How do commits work? ====
+ ==== 7. How do commits work? ====
  
  ''It seems like the master might want to be involved in commits too, or maybe 
we just rely on the slave to master heartbeat to kick of immediately after a 
commit so that index replication can be initiated? I like the latter approach. 
New versions are only published as frequently as clients poll the master for 
updated IndexLocations. Clients keep a cache of both readable and updatable 
index locations that are periodically refreshed.''
  
@@ -173, +194 @@

  
  There are probably some subtlties about heartbeating mechanisms on large 
clusters. For example, if heartbeats are staggered it is good (no collisions), 
if they are synchronous it is bad (collisions). Does it settle to a steady 
state? If so changing heartbeat intervals could cause problems. I'm just 
guessing here though. The main point is this is a complex system, so sometimes 
doing things that seem obvious can have unexpected (and possibly deterimental) 
effects. It's important to do the math to check the likely result is what you 
expect. 
  
- ==== 7. Finding updateable indexes ====
+ ==== 8. Finding updateable indexes ====
  
  ''Looking at''
  {{{
@@ -183, +204 @@

  
  It's hard to give guarantees here as nodes can fail at any time ...
  
- ==== 8. What is an Index ID? ====
+ ==== 9. What is an Index ID? ====
  
  ''But what is index id exactly?  Looking at the example API you laid down, it 
must be a single physical index (as opposed to a logical index).  In which 
case, is it entirely up to the client to manage multi-shard indicies?  For 
example, if we had a "photo" index broken up into 3 shards, each shard would 
have a separate index id and it would be up to the client to know this, and to 
query across the different "photo0", "photo1", "photo2" indicies.  The master 
would
  have no clue those indicies were related.  Hmmm, that doesn't work very well 
for deletes though.
@@ -194, +215 @@

  
  This comes back to how we implement sharding discussed above ...
  
- ==== 9. What about SOLR? ====
+ ==== 10. What about SOLR? ====
  
  ''It depends on the project scope and how extensible things are. It seems 
like the master would be a WAR, capable of running stand-alone. What about 
index servers (slaves)?  Would this project include just the interfaces to be 
implemented by Solr/Nutch nodes, some common implementation code behind the 
interfaces in the form of a library, or also complete standalone WARs?
  
  I'd need to be able to extend the ClientToSlave protocol to add additional 
methods for Solr (for passing in extra parameters and returning various extra 
data such as facets, highlighting, etc).''
  
- ==== 10. How does versioning work? ====
+ ==== 11. How does versioning work? ====
  
  ''Could this be done in Lucene? It would also need a way to open a specific 
index version (rather than just the latest), but I guess that could also be 
hacked into Directory by hiding later "segments" files (assumes lockless is 
committed).''
  
  Current version: At the moment, it just copies the index for a new version. 
This is going to be expensive in disk space. But when it replicates indexes, it 
just copies recent changes if a local copy of the index already exists, so 
network traffic should be efficient. 
  
- === Mark's comments ===
- 
- Rather than using HDFS, DLucene is heavily inspired by HDFS. This is because 
the files uses in Lucene indexes are quite different from the files that HDFS 
was designed for. It uses a similar replication algorithm, and where possible 
HDFS code although it was necessary to make some local changes to the 
visibility of some classes and methods. 
- 
- Unlike HDFS it currently uses a state less Name node. In the event of a 
failure, the heartbeat information sent by each worker contains a list of all 
indexes they own, and also the current status of those indexes. This means it 
should be possible to swap over masters. However the disadvantage is this will 
result in more network traffic per heartbeat.
- 
- Both the master and workers have a heart beat architecture. On a worker 
heartbeat, it sends information to the master about its status. In addition, 
there is a second thread that performs examines a queue of replication tasks, 
and performs them one at a time (there may be optimisations here). On a master 
heartbeat, the master performs failure detection and also computes a 
replication plan. A segment of this plan is then sent back to the correct 
worker on the next heartbeat. 
- 
- I have an abstract node class that both the worker and the master inherit 
from to simplify the code. 
- 
- == Next Steps ==
- 
- Design the client API. 
- 
- One of the issues here is whether sharding should be handled solely at the 
client, using the API defined above. For example you could have myindex-1, 
myindex-2 and myindex-3 are the shards of my-index. However then the client 
takes responsibility for sharding, and the Master and Workers know nothing 
about it. The other approach would be to extend the API outlined above so that 
it knows about shards, so that the workers store metadata about the 
relationship between shards, which is then sent to the master, so the client 
can query it rather than inferring it. 
- 
- To insert data, use a consistent hashing algorithm as described here 
http://problemsworthyofattack.blogspot.com/2007/11/consistent-hashing.html
- 
- Then provide a query operation which calls all the shards.
- 
  == Related Pages ==
  
  http://wiki.apache.org/solr/DistributedSearch - Distributed search in SOLR

[Lucene-hadoop Wiki] Trivial Update of "DistributedLucene" by MarkButler

Reply via email to