[Lucene-hadoop Wiki] Trivial Update of "DistributedLucene" by MarkButler

Apache Wiki Wed, 19 Dec 2007 03:42:35 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by MarkButler:
http://wiki.apache.org/lucene-hadoop/DistributedLucene

------------------------------------------------------------------------------
  
  One potental problem is a document overwrite implemented as a delete then an 
add. More than one client doing this for the same document could result in 0 or 
2 documents, instead of 1.  I guess clients will just need to be relatively 
coordinated in their activities. Either the two clients must coordinate, to 
make sure that they're not updating the same document at the same time, or use 
a strategy where updates are routed to the slave that contained the old version 
of the document. That would require a broadcast query to figure out which slave 
that is.
  
+ Good point. Either the two clients must coordinate, to make sure that they're 
not updating the same document at the same time, or use a strategy where 
updates are routed to the slave that contained the old version of the document. 
That would require a broadcast query to figure out which slave that is.
+ 
  4. How do additions work?
  
- It's unfortunate the master needs to be involved on every document add? That 
should not normally be the case. Clients can cache the set of writable index 
locations and directly submit new documents without involving the master.
+ The master should not be involved in adds. Clients can cache the set of 
writable index locations and directly submit new documents without involving 
the master.
+ 
+ The master should be out of the loop as much as possible. One approach is 
that clients randomly assign documents to indexes and send the updates directly 
to the indexing node. Alternately, clients might index locally, then ship the 
updates to a node packaged as an index. That was the intent of the addIndex 
method.
  
  5. How do commits work?
  
  It seems like the master might want to be involved in commits too, or maybe 
we just rely on the slave to master heartbeat to kick of immediately after a 
commit so that index replication can be initiated? I like the latter approach. 
New versions are only published as frequently as clients poll the master for 
updated IndexLocations. Clients keep a cache of both readable and updatable 
index locations that are periodically refreshed.
  
- other comments
+ 6. Broadcasts versus IPC
  
- If deletes were broadcast, and documents could go to any partition, that 
would be one way around it (with the downside of a less powerful master that 
could implement certain distribution policies). Another way to lessen the 
master-in-the-middle cost is to make sure one can aggregate small requests.
+ Currently Hadoop does not support broadcasts, and there are problems getting 
broadcasts to work across clusters. Do we need to use broadcasts or can we use 
the same approach as HDFS and Hbase?
+ 
+ 7. Finding updateable indexes
  
  Looking at 
  {{{
@@ -112, +118 @@

  }}}
  I'd assumed that the updateable version of an index does not move around very 
often. Perhaps a lease mechanism is required. For example, a call to 
getUpdateableIndex might be valid for ten minutes.
  
- We might consider a delete() on the master interface too. That way it could
-     3) hide the delete policy (broadcast or directl-to-server-that-has-doc)
-     2) potentially do some batching of deletes
-     1) simply do the delete locally if there is a single index partition
-     and this is a combination master/searcher
- 
- I'm reticent to put any frequently-made call on the master. I'd prefer to 
keep the master only involved at an executive level, with all per-document and 
per-query traffic going directly from client to slave.
- 
- 
- 
- I was not imagining a real-time system, where the next query after a document 
is added would always include that document. Is that a requirement? That's 
harder.
- 
- At this point I'm mostly trying to see if this functionality would meet the 
needs of Solr, Nutch and others.
- 
- Must we include a notion of document identity and/or document version in the 
mechanism? Would that facillitate updates and coherency?
- 
- In Nutch a typical case is that you have a bunch of URLs with content that 
may-or-may-not have been previously indexed. The approach I'm currently leaning 
towards is that we'd broadcast the deletions of all of these to all slaves, 
then add index them to randomly assigned indexes. In Nutch multiple clients 
would naturally be coordinated, since each url is represented only once in each 
update cycle.
- 
  === Reply from Doug ===
- 
- Yonik Seeley wrote:
- 
-     On 10/18/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
- 
-         We assume that, within an index, a file with a given name is written
-         only once.
- 
-     Is this necessary, and will we need the lockless patch (that avoids
-     renaming or rewriting *any* files), or is Lucene's current index
-     behavior sufficient?
- 
- It's not strictly required, but it would make index synchronization a lot 
simpler. Yes, I was assuming the lockless patch would be committed to Lucene 
before this project gets very far. Something more than that would be required 
in order to keep old versions, but this could be as simple as a Directory 
subclass that refuses to remove files for a time.
- 
-     The search side seems straightforward enough, but I haven't totally
-     figured out how the update side should work.
- 
- The master should be out of the loop as much as possible. One approach is 
that clients randomly assign documents to indexes and send the updates directly 
to the indexing node. Alternately, clients might index locally, then ship the 
updates to a node packaged as an index. That was the intent of the addIndex 
method.
- 
-     One potental problem is a document overwrite implemented as a delete
-     then an add.
-     More than one client doing this for the same document could result in
-     0 or 2 documents, instead of 1.  I guess clients will just need to be
-     relatively coordinated in their activities.
- 
- Good point. Either the two clients must coordinate, to make sure that they're 
not updating the same document at the same time, or use a strategy where 
updates are routed to the slave that contained the old version of the document. 
That would require a broadcast query to figure out which slave that is.
- 
-     It's unfortunate the master needs to be involved on every document add.
  
  That should not normally be the case. Clients can cache the set of writable 
index locations and directly submit new documents without involving the master.
  
@@ -280, +240 @@

  It doesn't need to be in the interfaces I don't think, so it depends
  on the scope of the index server implementations.
  
- -Yonik
- 
  
  === Mark's comments ===

[Lucene-hadoop Wiki] Trivial Update of "DistributedLucene" by MarkButler

Reply via email to