[Hadoop Wiki] Update of "DistributedLucene" by MarkButler

Apache Wiki Tue, 02 Jun 2009 06:35:01 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by MarkButler:
http://wiki.apache.org/hadoop/DistributedLucene

------------------------------------------------------------------------------
  = Distributed Lucene =
+ 
+ This work has now been superseded by the Katta project
+ 
+ Katta project - http://www.sourceforge.net/projects/katta
  
  Doug Cutting's original proposal: 
http://www.mail-archive.com/[email protected]/msg00338.html
  
+ Also see
- Code for this work is now available here:
- https://issues.apache.org/jira/browse/HADOOP-3394
  
- Also see
  Bailey project - http://www.sourceforge.net/projects/bailey
- Katta project - http://www.sourceforge.net/projects/katta
+ 
  Contrib for updating indexes using MapReduce - 
https://issues.apache.org/jira/browse/HADOOP-2951
  
- === Implementation Notes (Obsolete, retained for comments) ===
- 
- === Current Status ===
- 
- Currently there is an alpha implementation of the design outlined above 
specifically the master, worker, client library and unit tests. 
- 
- Rather than using HDFS, the implementation (DLucene) is heavily inspired by 
HDFS. This is because the files uses in Lucene indexes are quite different from 
the files that HDFS was designed for. It uses a similar replication algorithm 
to HDFS, and where possible HDFS uses code although it was necessary to make 
some local changes to the visibility of some classes and methods. 
- 
- Unlike HDFS it currently uses a state less Master. In the event of a failure, 
the heartbeat information sent by each worker contains a list of all indexes 
they own, and also the current status of those indexes. This means it should be 
possible to swap over masters. However the disadvantage is this will result in 
more network traffic per heartbeat.
- 
- Both the master and workers have a heart beat architecture. Workers have 
three types of threads: one to service requests, one to send heartbeats to the 
master to inform it that the worker is alive, and one to process replication 
tasks. Workers have two types of threads: one to service requests, the other to 
perform failure detection and compute a replication plan. A segment of this 
plan is then sent back to each worker in response to their heartbeat.
- 
- One of the aims of the implementation is to better understand how Hadoop 
works, so that it is possible to create an architecture to simplify the 
creation of other specialized storage or processing components for Hadoop. 
- 
- There are also a number of outstanding items of functionality:
- 
-    * There is no process to delete old versions of indexes after a 
predetermined time, as in HDFS. 
-    * The implementation does not take advantage of Lucene's RAM based indexes 
to improve efficient.
-    * In HDFS there is a "throttler" to control client requests. There is no 
equivalent functionlity in DLucene.
-    * A cluster contains a number of replicas of an index, to support high 
availability / load balancing. When a client writes to an index, the updates 
are sent to one replica which creates a new version. When the changes are 
committed, the new version is then propagated to other machines in the cluster. 
However if several clients update the same index at the same time, this needs 
to be synchronized to the same replica. I haven't worked out a way of doing 
this yet. See below for some discussion of some design alternatives.
-    * Replication takes advantage of the fact that it is quicker if a worker 
has an old copy of an index. However the replication assignment algorithm does 
not yet take advantage of this. 
-    * No benchmarks yet.
- 
- === Design decisions ===
- 
- ==== 1. Broadcasts versus IPC ====
- 
- In the original design discussion by Doug Cutting and Yonik Seeley, 
broadcasts were considered for deletes and searches. The current implementation 
uses basic Hadoop IPC rather than broadcasts. 
- 
- ==== 2. Sharding ====
- 
- The masters and workers know nothing about sharding. This is done at the 
client. This is to simplify the master / worker implementations. 
- 
- ==== 3. How do searches work? ====
- 
- ''Searches could be broadcast to one index with each id and return merged 
results. The client will load-balance both searches and updates.''
- 
- Current approach: The master and the workers know nothing about shards, so 
sharding needs to implemented in the client API. One way to do this to adopt a 
convention to store the index name and shard as a composite value in the index 
ID, for example
- {{{
- myindex-1
- myindex-2
- myindex-3
- }}} 
- then to perform a search, in the non-sharded case the client gets a list of 
all index locations and identifies a set of possible locations it could query. 
It selects a replicas at random to query, in the process performing load 
balancing. 
- 
- In the sharded case, it does the same thing but queries one replica of each 
shard. The workers return results and the client API aggregates them. 
- 
- ==== 4. How do deletions work? ====
- 
- ''Deletions could be broadcast to all slaves. That would probably be fast 
enough. Alternately, indexes could be partitioned by a hash of each document's 
unique id, permitting deletions to be routed to the appropriate slave.''
- 
- Current approach: On non-sharded indexes, deletions are sent directly to the 
worker. On sharded ones, they work like searches described above. 
- 
- ==== 5. How does update work? ====
- 
- ''The master should be out of the loop as much as possible. One approach is 
that clients randomly assign documents to indexes and send the updates directly 
to the indexing node. 
- 
- One potental problem is a document overwrite implemented as a delete then an 
add. More than one client doing this for the same document could result in 0 or 
2 documents, instead of 1.  I guess clients will just need to be relatively 
coordinated in their activities. Either the two clients must coordinate, to 
make sure that they're not updating the same document at the same time, or use 
a strategy where updates are routed to the slave that contained the old version 
of the document. That would require a broadcast query to figure out which slave 
that is.
- 
- Good point. Either the two clients must coordinate, to make sure that they're 
not updating the same document at the same time, or use a strategy where 
updates are routed to the slave that contained the old version of the document. 
That would require a broadcast query to figure out which slave that is.''
- 
- This needs some discussion. Currently the implementation uses a versioning 
approach, so when a client starts to make changes to an index, the relevant 
worker copies the index, increments the version number, and sets the IndexState 
as UNCOMMTTTED. The client can then add and delete documents or add a remote 
index. When the client commits the index, it becomes available to be searched 
and replicated. 
- 
- Therefore there is a danger that two clients could edit the same index at the 
same time. One possibility here would be to bind a particular version of an 
index to a particular client. If the client fails that is not a problem, the 
changes are just uncommitted. However there is still a danger of a race 
condition when there are two different branches of the same index. 
- 
- ==== 6. How do additions work? ====
- 
- The master should not be involved in adds. Clients can cache the set of 
writable index locations and directly submit new documents without involving 
the master.
- 
- The master should be out of the loop as much as possible. One approach is 
that clients randomly assign documents to indexes and send the updates directly 
to the indexing node. '''Alternately, clients might index locally, then ship 
the updates to a node packaged as an index. That was the intent of the addIndex 
method.'''
- 
- ==== 7. How do commits work? ====
- 
- ''It seems like the master might want to be involved in commits too, or maybe 
we just rely on the slave to master heartbeat to kick of immediately after a 
commit so that index replication can be initiated? I like the latter approach. 
New versions are only published as frequently as clients poll the master for 
updated IndexLocations. Clients keep a cache of both readable and updatable 
index locations that are periodically refreshed.''
- 
- Current approach: When an index becomes committed, wait until next heartbeat 
for replication to begin. It might be possible to bring this forward, but 
realistically I couldn't see any advantage to this. 
- 
- There are probably some subtlties about heartbeating mechanisms on large 
clusters. For example, if heartbeats are staggered it is good (no collisions), 
if they are synchronous it is bad (collisions). Does it settle to a steady 
state? If so changing heartbeat intervals could cause problems. I'm just 
guessing here though. The main point is this is a complex system, so sometimes 
doing things that seem obvious can have unexpected (and possibly deterimental) 
effects. It's important to do the math to check the likely result is what you 
expect. 
- 
- ==== 8. Finding updateable indexes ====
- 
- ''Looking at''
- {{{
-        IndexLocation[] getUpdateableIndex(String[] id);
- }}}
- ''I'd assumed that the updateable version of an index does not move around 
very often. Perhaps a lease mechanism is required. For example, a call to 
getUpdateableIndex might be valid for ten minutes.''
- 
- It's hard to give guarantees here as nodes can fail at any time ...
- 
- ==== 9. What is an Index ID? ====
- 
- ''But what is index id exactly?  Looking at the example API you laid down, it 
must be a single physical index (as opposed to a logical index).  In which 
case, is it entirely up to the client to manage multi-shard indicies?  For 
example, if we had a "photo" index broken up into 3 shards, each shard would 
have a separate index id and it would be up to the client to know this, and to 
query across the different "photo0", "photo1", "photo2" indicies.  The master 
would
- have no clue those indicies were related.  Hmmm, that doesn't work very well 
for deletes though.
- 
- It seems like there should be the concept of a logical index, that is 
composed of multiple shards, and each shard has multiple copies.
- 
- Or were you thinking that a cluster would only contain a single logical 
index, and hence all different index ids are simply different shards of that 
single logical index?  That would seem to be consistent with 
ClientToMasterProtocol .getSearchableIndexes() lacking an id argument.''
- 
- This comes back to how we implement sharding discussed above ...
- 
- ==== 10. What about SOLR? ====
- 
- ''It depends on the project scope and how extensible things are. It seems 
like the master would be a WAR, capable of running stand-alone. What about 
index servers (slaves)?  Would this project include just the interfaces to be 
implemented by Solr/Nutch nodes, some common implementation code behind the 
interfaces in the form of a library, or also complete standalone WARs?
- 
- I'd need to be able to extend the ClientToSlave protocol to add additional 
methods for Solr (for passing in extra parameters and returning various extra 
data such as facets, highlighting, etc).''
- 
- ==== 11. How does versioning work? ====
- 
- ''Could this be done in Lucene? It would also need a way to open a specific 
index version (rather than just the latest), but I guess that could also be 
hacked into Directory by hiding later "segments" files (assumes lockless is 
committed).''
- 
- Current version: At the moment, it just copies the index for a new version. 
This is going to be expensive in disk space. But when it replicates indexes, it 
just copies recent changes if a local copy of the index already exists, so 
network traffic should be efficient. 
- 
- == Related Pages ==
- 
- http://wiki.apache.org/solr/DistributedSearch - Distributed search in SOLR
- 
- http://wiki.apache.org/solr/CollectionDistribution - Collection distribution 
in SOLR
-

[Hadoop Wiki] Update of "DistributedLucene" by MarkButler

Reply via email to