On 10/18/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
We assume that, within an index, a file with a given name is written only once.
Is this necessary, and will we need the lockless patch (that avoids renaming or rewriting *any* files), or is Lucene's current index behavior sufficient? I like the explicit index version and keeping the last few version around. The whole idea of a master seems to lessen the amount of manual configuration in large clusters too. The search side seems straightforward enough, but I haven't totally figured out how the update side should work.
Deletions could be broadcast to all slaves. That would probably be fast enough.
Hmmm, that does allow one to move documents around the cluster and more easily resize things. One potental problem is a document overwrite implemented as a delete then an add. More than one client doing this for the same document could result in 0 or 2 documents, instead of 1. I guess clients will just need to be relatively coordinated in their activities.
Alternately, indexes could be partitioned by a hash of each document's unique id, permitting deletions to be routed to the appropriate slave.
A hash is nice, but then you can't resize the number of partitions your index is split into. It's unfortunate the master needs to be involved on every document add. If deletes were broadcast, and documents could go to any partition, that would be one way around it (with the downside of a less powerful master that could implement certain distribution policies). Another way to lessen the master-in-the-middle cost is to make sure one can aggregate small requests: IndexLocation getUpdateableIndex(String id); We might consider a delete() on the master interface too. That way it could 3) hide the delete policy (broadcast or directl-to-server-that-has-doc) 2) potentially do some batching of deletes 1) simply do the delete locally if there is a single index partition and this is a combination master/searcher It seems like the master might want to be involved in commits too, or maybe we just rely on the slave to master heartbeat to kick of immediately after a commit so that index replication can be initiated?
Does this make sense? Does it sound like it would be useful to Solr? To Nutch? To others? Who would be interested and able to work on it?
Still interested, and able :-) -Yonik