Hello all,

before I ask you my questions, I would like to describe the circumstances of my problem(s) to you.

I am developing a meta-searchengine in Java with a webfrontend (html) for out agency. Our users should have one "point of search", which is searching for the users in different information sources. These sources can be:

   * HTTP-/WebDAV-Server
   * Fileserver (SMB) with authorization check
   * Applications with documents in databases with proprietary
     authorization check

My application offers a java interface (ISearchSource), with which different search engines can be plugged in. By now, Oracle UltraSearch and the search engine of the SharePoint Portal Server (SPS) from Microsoft can be used. Theses search engines are doing the crawling, indexing etc. In a few words, my application is forwarding the search request of the user to these engines (SPS: using the webservice), collecting the results and at last give it merged to the user. Applications with proprietary authorization check are called via webservice with information about the user, the result is merged with the results of the other search engines.

The user can restrict search request to some information sources (e.g. HTTP-server A and HTTP-server B and application C)

Now I like to implement an interface for Nutch (to get rid of UltraSearch, perhaps SPS too). For that, I have to teach Nutch to search in different information sources (e.g. HTTP-server A, HTTP-server B etc). I also like to update the index of the different information sources within different time intervals.

After reading the mailings, I see two approaches (my knowledge of Nutch is very poor by now, hope these approaches are not
totally impossible):

  1. Use a index for every information source. Then instantiate an
     implementation of my interface (ISearchSource) for every index
     with its own NutchBean for doing the queries. The results of these
     query is merged by my application with the results of the other
     engines.
  2. Index the different sources at different times and merge these
     "subindex" to one "whole index". Instantiate one implementation
     of  my interface with one NutchBean. For queries, selecting of the
     information source is done by an attribute (site? url?).

I hope, I was able to show you my circumstandes half-decent. For now, I have the following problem. All search results from the different engines must have a "normalized relevance". For example, search engine A has relevance scores between 0 and 100 (0 = poor relevance, 50 = medium relevance, 100 = top relevance), search engine B between 0 and 1000. My application uses relevances between 0 and 1000, so the relevance of engine A must be extended to a range between 0 and 1000.

Now (at last) my questions:

  1. Is there a third approach I  have missed?
  2. Whats the maximum value of the score of the results of Nutch? I
     was looking for this information in different places, ending in a
     complex algorithm of Nutch/Lucene, which to my shame I didn`t
     understand yet.
  3. Is the relevance uniformly distributed within the minimum and
     maximum value ("linear relevance")? With "linear relevance" I mean
     that a relevance of 50% (the half of the maximum value) is half so
     relevant as a relevance score of 100% (maximum value).
  4. Approach 2 above: When merging indexes, do I have to stop my
     search engine? In one mailing I read that this is only necessary
     when running on windows. Is this right?

Thanks in advance for your answers and sorry for my poor english...

Markus

Reply via email to