Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.
The following page has been changed by MarkButler: http://wiki.apache.org/lucene-hadoop/DistributedLucene ------------------------------------------------------------------------------ } }}} - == Implementation Notes == + === Implementation Notes == - === Issues to be discussed === + === Current Status === + + Currently there is an alpha implementation of the design outlined above specifically the master, worker, client library and unit tests. This is awaiting review by HP for contribution to the Apache Foundation. + + Rather than using HDFS, the implementation (DLucene) is heavily inspired by HDFS. This is because the files uses in Lucene indexes are quite different from the files that HDFS was designed for. It uses a similar replication algorithm to HDFS, and where possible HDFS uses code although it was necessary to make some local changes to the visibility of some classes and methods. + + Unlike HDFS it currently uses a state less Master. In the event of a failure, the heartbeat information sent by each worker contains a list of all indexes they own, and also the current status of those indexes. This means it should be possible to swap over masters. However the disadvantage is this will result in more network traffic per heartbeat. + + Both the master and workers have a heart beat architecture. Workers have three types of threads: one to service requests, one to send heartbeats to the master to inform it that the worker is alive, and one to process replication tasks. Workers have two types of threads: one to service requests, the other to perform failure detection and compute a replication plan. A segment of this plan is then sent back to each worker in response to their heartbeat. + + One of the aims of the implementation is to better understand how Hadoop works, so that it is possible to create an architecture to simplify the creation of other specialized storage or processing components for Hadoop. + + There are also a number of outstanding items of functionality: + + * There is no process to delete old versions of indexes after a predetermined time, as in HDFS. + * The implementation does not take advantage of Lucene's RAM based indexes to improve efficient. + * In HDFS there is a "throttler" to control client requests. There is no equivalent functionlity in DLucene. + * A cluster contains a number of replicas of an index, to support high availability / load balancing. When a client writes to an index, the updates are sent to one replica which creates a new version. When the changes are committed, the new version is then propagated to other machines in the cluster. However if several clients update the same index at the same time, this needs to be synchronized to the same replica. I haven't worked out a way of doing this yet. See below for some discussion of some design alternatives. + * Replication takes advantage of the fact that it is quicker if a worker has an old copy of an index. However the replication assignment algorithm does not yet take advantage of this. + * No benchmarks yet. + + === Design decisions === ==== 1. Broadcasts versus IPC ==== - In the previous design discussion, there was talk about using broadcasts. However Hadoop does not support broadcasts, and there are problems getting broadcasts to work across clusters. Do we need to use broadcasts or can we use the same approach as HDFS and Hbase? + In the original design discussion by Doug Cutting and Yonik Seeley, broadcasts were considered for deletes and searches. The current implementation uses basic Hadoop IPC rather than broadcasts. - Current approach: do not use broadcasts. + ==== 2. Sharding ==== + The masters and workers know nothing about sharding. This is done at the client. This is to simplify the master / worker implementations. + - ==== 2. How do searches work? ==== + ==== 3. How do searches work? ==== ''Searches could be broadcast to one index with each id and return merged results. The client will load-balance both searches and updates.'' @@ -139, +162 @@ In the sharded case, it does the same thing but queries one replica of each shard. The workers return results and the client API aggregates them. - ==== 3. How do deletions work? ==== + ==== 4. How do deletions work? ==== ''Deletions could be broadcast to all slaves. That would probably be fast enough. Alternately, indexes could be partitioned by a hash of each document's unique id, permitting deletions to be routed to the appropriate slave.'' Current approach: On non-sharded indexes, deletions are sent directly to the worker. On sharded ones, they work like searches described above. - ==== 4. How does update work? ==== + ==== 5. How does update work? ==== ''The master should be out of the loop as much as possible. One approach is that clients randomly assign documents to indexes and send the updates directly to the indexing node. @@ -157, +180 @@ Therefore there is a danger that two clients could edit the same index at the same time. One possibility here would be to bind a particular version of an index to a particular client. If the client fails that is not a problem, the changes are just uncommitted. However there is still a danger of a race condition when there are two different branches of the same index. - ==== 5. How do additions work? ==== + ==== 6. How do additions work? ==== The master should not be involved in adds. Clients can cache the set of writable index locations and directly submit new documents without involving the master. The master should be out of the loop as much as possible. One approach is that clients randomly assign documents to indexes and send the updates directly to the indexing node. '''Alternately, clients might index locally, then ship the updates to a node packaged as an index. That was the intent of the addIndex method.''' - TODO: Think how to do this in the client API. - - ==== 6. How do commits work? ==== + ==== 7. How do commits work? ==== ''It seems like the master might want to be involved in commits too, or maybe we just rely on the slave to master heartbeat to kick of immediately after a commit so that index replication can be initiated? I like the latter approach. New versions are only published as frequently as clients poll the master for updated IndexLocations. Clients keep a cache of both readable and updatable index locations that are periodically refreshed.'' @@ -173, +194 @@ There are probably some subtlties about heartbeating mechanisms on large clusters. For example, if heartbeats are staggered it is good (no collisions), if they are synchronous it is bad (collisions). Does it settle to a steady state? If so changing heartbeat intervals could cause problems. I'm just guessing here though. The main point is this is a complex system, so sometimes doing things that seem obvious can have unexpected (and possibly deterimental) effects. It's important to do the math to check the likely result is what you expect. - ==== 7. Finding updateable indexes ==== + ==== 8. Finding updateable indexes ==== ''Looking at'' {{{ @@ -183, +204 @@ It's hard to give guarantees here as nodes can fail at any time ... - ==== 8. What is an Index ID? ==== + ==== 9. What is an Index ID? ==== ''But what is index id exactly? Looking at the example API you laid down, it must be a single physical index (as opposed to a logical index). In which case, is it entirely up to the client to manage multi-shard indicies? For example, if we had a "photo" index broken up into 3 shards, each shard would have a separate index id and it would be up to the client to know this, and to query across the different "photo0", "photo1", "photo2" indicies. The master would have no clue those indicies were related. Hmmm, that doesn't work very well for deletes though. @@ -194, +215 @@ This comes back to how we implement sharding discussed above ... - ==== 9. What about SOLR? ==== + ==== 10. What about SOLR? ==== ''It depends on the project scope and how extensible things are. It seems like the master would be a WAR, capable of running stand-alone. What about index servers (slaves)? Would this project include just the interfaces to be implemented by Solr/Nutch nodes, some common implementation code behind the interfaces in the form of a library, or also complete standalone WARs? I'd need to be able to extend the ClientToSlave protocol to add additional methods for Solr (for passing in extra parameters and returning various extra data such as facets, highlighting, etc).'' - ==== 10. How does versioning work? ==== + ==== 11. How does versioning work? ==== ''Could this be done in Lucene? It would also need a way to open a specific index version (rather than just the latest), but I guess that could also be hacked into Directory by hiding later "segments" files (assumes lockless is committed).'' Current version: At the moment, it just copies the index for a new version. This is going to be expensive in disk space. But when it replicates indexes, it just copies recent changes if a local copy of the index already exists, so network traffic should be efficient. - === Mark's comments === - - Rather than using HDFS, DLucene is heavily inspired by HDFS. This is because the files uses in Lucene indexes are quite different from the files that HDFS was designed for. It uses a similar replication algorithm, and where possible HDFS code although it was necessary to make some local changes to the visibility of some classes and methods. - - Unlike HDFS it currently uses a state less Name node. In the event of a failure, the heartbeat information sent by each worker contains a list of all indexes they own, and also the current status of those indexes. This means it should be possible to swap over masters. However the disadvantage is this will result in more network traffic per heartbeat. - - Both the master and workers have a heart beat architecture. On a worker heartbeat, it sends information to the master about its status. In addition, there is a second thread that performs examines a queue of replication tasks, and performs them one at a time (there may be optimisations here). On a master heartbeat, the master performs failure detection and also computes a replication plan. A segment of this plan is then sent back to the correct worker on the next heartbeat. - - I have an abstract node class that both the worker and the master inherit from to simplify the code. - - == Next Steps == - - Design the client API. - - One of the issues here is whether sharding should be handled solely at the client, using the API defined above. For example you could have myindex-1, myindex-2 and myindex-3 are the shards of my-index. However then the client takes responsibility for sharding, and the Master and Workers know nothing about it. The other approach would be to extend the API outlined above so that it knows about shards, so that the workers store metadata about the relationship between shards, which is then sent to the master, so the client can query it rather than inferring it. - - To insert data, use a consistent hashing algorithm as described here http://problemsworthyofattack.blogspot.com/2007/11/consistent-hashing.html - - Then provide a query operation which calls all the shards. - == Related Pages == http://wiki.apache.org/solr/DistributedSearch - Distributed search in SOLR