Hello all,
before I ask you my questions, I would like to describe the
circumstances of my problem(s) to you.
I am developing a meta-searchengine in Java with a webfrontend (html)
for out agency.
Our users should have one "point of search", which is searching for the
users in different information sources. These sources can be:
* HTTP-/WebDAV-Server
* Fileserver (SMB) with authorization check
* Applications with documents in databases with proprietary
authorization check
My application offers a java interface (ISearchSource), with which
different search engines can be plugged in. By now, Oracle UltraSearch
and the search engine of the SharePoint Portal Server (SPS) from
Microsoft can be used. Theses search engines are doing the crawling,
indexing etc.
In a few words, my application is forwarding the search request of the
user to these engines (SPS: using the webservice), collecting the
results and at last give it merged to the user. Applications with
proprietary authorization check are called via webservice with
information about the user, the result is merged with the results of the
other search engines.
The user can restrict search request to some information sources (e.g.
HTTP-server A and HTTP-server B and application C)
Now I like to implement an interface for Nutch (to get rid of
UltraSearch, perhaps SPS too). For that, I have to teach Nutch to search
in different information sources (e.g. HTTP-server A, HTTP-server B
etc). I also like to update the index of the different information
sources within different time intervals.
After reading the mailings, I see two approaches (my knowledge of Nutch
is very poor by now, hope these approaches are not
totally impossible):
1. Use a index for every information source. Then instantiate an
implementation of my interface (ISearchSource) for every index
with its own NutchBean for doing the queries. The results of these
query is merged by my application with the results of the other
engines.
2. Index the different sources at different times and merge these
"subindex" to one "whole index". Instantiate one implementation
of my interface with one NutchBean. For queries, selecting of the
information source is done by an attribute (site? url?).
I hope, I was able to show you my circumstandes half-decent. For now, I
have the following problem. All search results from
the different engines must have a "normalized relevance". For example,
search engine A has relevance scores between
0 and 100 (0 = poor relevance, 50 = medium relevance, 100 = top
relevance), search engine B between 0 and 1000. My application uses
relevances between 0 and 1000, so the relevance of engine A must be
extended to a range between 0 and 1000.
Now (at last) my questions:
1. Is there a third approach I have missed?
2. Whats the maximum value of the score of the results of Nutch? I
was looking for this information in different places, ending in a
complex algorithm of Nutch/Lucene, which to my shame I didn`t
understand yet.
3. Is the relevance uniformly distributed within the minimum and
maximum value ("linear relevance")? With "linear relevance" I mean
that a relevance of 50% (the half of the maximum value) is half so
relevant as a relevance score of 100% (maximum value).
4. Approach 2 above: When merging indexes, do I have to stop my
search engine? In one mailing I read that this is only necessary
when running on windows. Is this right?
Thanks in advance for your answers and sorry for my poor english...
Markus