Damn Y! mail shortcut. The link to the project is in my Lucene group: http://www.simpy.com/group/363
Otis ----- Original Message ---- From: Alexandru Popescu <[EMAIL PROTECTED]> To: general@lucene.apache.org Sent: Thursday, October 19, 2006 10:19:00 AM Subject: Re: [Fwd: [PROPOSAL] index server project] I am not sure this is (somehow) related, but I think I have noticed some project on a Sun contest (it was the big prize winner). I cannot retrieve it now, but hopefully somebody else will. ./alex -- .w( the_mindstorm )p. On 10/19/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote: > Hi Doug, > > we discussed the need of such a tool several times internally and > developed some workarounds for nutch, so I would be definitely > interested to contribute to such a project. > Having a separated project that depends on hadoop would be the best > case for our usecases. > > Best, > Stefan > > > > Am 18.10.2006 um 23:35 schrieb Doug Cutting: > > > FYI, I just pitched a new project you might be interested in on > > [EMAIL PROTECTED] Dunno if you subscribe to that list, so I'm > > spamming you. If it sounds interesting, please reply there. My > > management at Y! is interested in this, so I'm 'in'. > > > > Doug > > > > -------- Original Message -------- > > Subject: [PROPOSAL] index server project > > Date: Wed, 18 Oct 2006 14:17:30 -0700 > > From: Doug Cutting <[EMAIL PROTECTED]> > > Reply-To: general@lucene.apache.org > > To: general@lucene.apache.org > > > > It seems that Nutch and Solr would benefit from a shared index serving > > infrastructure. Other Lucene-based projects might also benefit from > > this. So perhaps we should start a new project to build such a thing. > > This could start either in java/contrib, or as a separate sub-project, > > depending on interest. > > > > Here are some quick ideas about how this might work. > > > > An RPC mechanism would be used to communicate between nodes (probably > > Hadoop's). The system would be configured with a single master node > > that keeps track of where indexes are located, and a number of slave > > nodes that would maintain, search and replicate indexes. Clients > > would > > talk to the master to find out which indexes to search or update, then > > they'll talk directly to slaves to perform searches and updates. > > > > Following is an outline of how this might look. > > > > We assume that, within an index, a file with a given name is written > > only once. Index versions are sets of files, and a new version of an > > index is likely to share most files with the prior version. Versions > > are numbered. An index server should keep old versions of each index > > for a while, not immediately removing old files. > > > > public class IndexVersion { > > String Id; // unique name of the index > > int version; // the version of the index > > } > > > > public class IndexLocation { > > IndexVersion indexVersion; > > InetSocketAddress location; > > } > > > > public interface ClientToMasterProtocol { > > IndexLocation[] getSearchableIndexes(); > > IndexLocation getUpdateableIndex(String id); > > } > > > > public interface ClientToSlaveProtocol { > > // normal update > > void addDocument(String index, Document doc); > > int[] removeDocuments(String index, Term term); > > void commitVersion(String index); > > > > // batch update > > void addIndex(String index, IndexLocation indexToAdd); > > > > // search > > SearchResults search(IndexVersion i, Query query, Sort sort, int n); > > } > > > > public interface SlaveToMasterProtocol { > > // sends currently searchable indexes > > // recieves updated indexes that we should replicate/update > > public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes); > > } > > > > public interface SlaveToSlaveProtocol { > > String[] getFileSet(IndexVersion indexVersion); > > byte[] getFileContent(IndexVersion indexVersion, String file); > > // based on experience in Hadoop, we probably wouldn't really use > > // RPC to send file content, but rather HTTP. > > } > > > > The master thus maintains the set of indexes that are available for > > search, keeps track of which slave should handle changes to an > > index and > > initiates index synchronization between slaves. The master can be > > configured to replicate indexes a specified number of times. > > > > The client library can cache the current set of searchable indexes and > > periodically refresh it. Searches are broadcast to one index with > > each > > id and return merged results. The client will load-balance both > > searches and updates. > > > > Deletions could be broadcast to all slaves. That would probably be > > fast > > enough. Alternately, indexes could be partitioned by a hash of each > > document's unique id, permitting deletions to be routed to the > > appropriate slave. > > > > Does this make sense? Does it sound like it would be useful to Solr? > > To Nutch? To others? Who would be interested and able to work on it? > > > > Doug > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > 101tec Inc. > search tech for web 2.1 > Menlo Park, California > http://www.101tec.com > > > > >