Fusing UltraSearch, SharePoint Portal Server and Applications with Nutch

Kreuzbube Sat, 17 Jun 2006 14:09:39 -0700

Hello all,

before I ask you my questions, I would like to describe thecircumstances of my problem(s) to you.

I am developing a meta-searchengine in Java with a webfrontend (html)for out agency.Our users should have one "point of search", which is searching for theusers in different information sources. These sources can be:


   * HTTP-/WebDAV-Server
   * Fileserver (SMB) with authorization check
   * Applications with documents in databases with proprietary
     authorization check

My application offers a java interface (ISearchSource), with whichdifferent search engines can be plugged in. By now, Oracle UltraSearchand the search engine of the SharePoint Portal Server (SPS) fromMicrosoft can be used. Theses search engines are doing the crawling,indexing etc.In a few words, my application is forwarding the search request of theuser to these engines (SPS: using the webservice), collecting theresults and at last give it merged to the user. Applications withproprietary authorization check are called via webservice withinformation about the user, the result is merged with the results of theother search engines.

The user can restrict search request to some information sources (e.g.HTTP-server A and HTTP-server B and application C)

Now I like to implement an interface for Nutch (to get rid ofUltraSearch, perhaps SPS too). For that, I have to teach Nutch to searchin different information sources (e.g. HTTP-server A, HTTP-server Betc). I also like to update the index of the different informationsources within different time intervals.

After reading the mailings, I see two approaches (my knowledge of Nutchis very poor by now, hope these approaches are not

totally impossible):

  1. Use a index for every information source. Then instantiate an
     implementation of my interface (ISearchSource) for every index
     with its own NutchBean for doing the queries. The results of these
     query is merged by my application with the results of the other
     engines.
  2. Index the different sources at different times and merge these
     "subindex" to one "whole index". Instantiate one implementation
     of  my interface with one NutchBean. For queries, selecting of the
     information source is done by an attribute (site? url?).

I hope, I was able to show you my circumstandes half-decent. For now, Ihave the following problem. All search results fromthe different engines must have a "normalized relevance". For example,search engine A has relevance scores between0 and 100 (0 = poor relevance, 50 = medium relevance, 100 = toprelevance), search engine B between 0 and 1000. My application usesrelevances between 0 and 1000, so the relevance of engine A must beextended to a range between 0 and 1000.


Now (at last) my questions:

  1. Is there a third approach I  have missed?
  2. Whats the maximum value of the score of the results of Nutch? I
     was looking for this information in different places, ending in a
     complex algorithm of Nutch/Lucene, which to my shame I didn`t
     understand yet.
  3. Is the relevance uniformly distributed within the minimum and
     maximum value ("linear relevance")? With "linear relevance" I mean
     that a relevance of 50% (the half of the maximum value) is half so
     relevant as a relevance score of 100% (maximum value).
  4. Approach 2 above: When merging indexes, do I have to stop my
     search engine? In one mailing I read that this is only necessary
     when running on windows. Is this right?

Thanks in advance for your answers and sorry for my poor english...

Markus

Fusing UltraSearch, SharePoint Portal Server and Applications with Nutch

Reply via email to