Re: Unable to create lucene index

2006-10-18 Thread Fredrik Andersson

If you want to create an index, you have to supply the true as the last
constructor argument to IndexWriter. The lock files use some kind of hash
for their ID:s and might very well persist even if you delete the directory.
So, delete new directory (if it ever was created), delete any lockfiles,
change the constructor and try again.

On 10/18/06, Deepa Paranjpe [EMAIL PROTECTED] wrote:


I want to create a lucene index and I use the following code :

Analyzer analyzer = new StopAnalyzer();
IndexWriter writer = new IndexWriter( mydir, analyzer, false );

First it gives me Lock obtain timed out: error.
When I remove the lock file from /tmp and run it, it says segment file not
found.

/segments (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:204)
at
org.apache.lucene.store.FSIndexInput$Descriptor.init(FSDirectory.java
:430)
at org.apache.lucene.store.FSIndexInput.init(FSDirectory.java
:439)
at
org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:329)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:45)
at
org.apache.lucene.index.IndexWriter$1.doBody(IndexWriter.java:264)
at org.apache.lucene.store.Lock$With.run(Lock.java:99)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java
:259)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java
:204)
at BuildIndex.main(BuildIndex.java:28)








[PROPOSAL] index server project

2006-10-18 Thread Doug Cutting
It seems that Nutch and Solr would benefit from a shared index serving 
infrastructure.  Other Lucene-based projects might also benefit from 
this.  So perhaps we should start a new project to build such a thing. 
This could start either in java/contrib, or as a separate sub-project, 
depending on interest.


Here are some quick ideas about how this might work.

An RPC mechanism would be used to communicate between nodes (probably 
Hadoop's).  The system would be configured with a single master node 
that keeps track of where indexes are located, and a number of slave 
nodes that would maintain, search and replicate indexes.  Clients would 
talk to the master to find out which indexes to search or update, then 
they'll talk directly to slaves to perform searches and updates.


Following is an outline of how this might look.

We assume that, within an index, a file with a given name is written 
only once.  Index versions are sets of files, and a new version of an 
index is likely to share most files with the prior version.  Versions 
are numbered.  An index server should keep old versions of each index 
for a while, not immediately removing old files.


public class IndexVersion {
  String Id;   // unique name of the index
  int version; // the version of the index
}

public class IndexLocation {
  IndexVersion indexVersion;
  InetSocketAddress location;
}

public interface ClientToMasterProtocol {
  IndexLocation[] getSearchableIndexes();
  IndexLocation getUpdateableIndex(String id);
}

public interface ClientToSlaveProtocol {
  // normal update
  void addDocument(String index, Document doc);
  int[] removeDocuments(String index, Term term);
  void commitVersion(String index);

  // batch update
  void addIndex(String index, IndexLocation indexToAdd);

  // search
  SearchResults search(IndexVersion i, Query query, Sort sort, int n);
}

public interface SlaveToMasterProtocol {
  // sends currently searchable indexes
  // recieves updated indexes that we should replicate/update
  public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
}

public interface SlaveToSlaveProtocol {
  String[] getFileSet(IndexVersion indexVersion);
  byte[] getFileContent(IndexVersion indexVersion, String file);
  // based on experience in Hadoop, we probably wouldn't really use
  // RPC to send file content, but rather HTTP.
}

The master thus maintains the set of indexes that are available for 
search, keeps track of which slave should handle changes to an index and 
initiates index synchronization between slaves.  The master can be 
configured to replicate indexes a specified number of times.


The client library can cache the current set of searchable indexes and 
periodically refresh it.  Searches are broadcast to one index with each 
id and return merged results.  The client will load-balance both 
searches and updates.


Deletions could be broadcast to all slaves.  That would probably be fast 
enough.  Alternately, indexes could be partitioned by a hash of each 
document's unique id, permitting deletions to be routed to the 
appropriate slave.


Does this make sense?  Does it sound like it would be useful to Solr? 
To Nutch?  To others?  Who would be interested and able to work on it?


Doug


Re: [PROPOSAL] index server project

2006-10-18 Thread Yonik Seeley

On 10/18/06, Doug Cutting [EMAIL PROTECTED] wrote:

Does this make sense?  Does it sound like it would be useful to Solr?
To Nutch?  To others?  Who would be interested and able to work on it?


Rather than holding my tounge until I wrap my head around all the
issues, I'll say that I'm definitely interested!

-Yonik