> -----Original Message-----
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
>
> You could use HitCollector for this:
> http://lucene.apache.org/java/docs/api/org/apache/lucene/search/HitCollector.html
>
After playing around i'm a bit stuck :-\
I use lucene as client server application with the help of
RemoteSearchable and MultiSearcher.
My first approach was to use a wrapper on client side for Hits which only
delivers Hits with a "good" score.
+ easy to implemt
+ works on normalized scores
- poor performance
Testquery was:
(NAME:peter AUTHOR:peter^0.9 NAME_AUTHOR:peter^0.6 SUBTITLE:peter^0.2)
LANG_PRIO:100^0.0010
Due to "LANG_PRIO:100^0.0010" lucene got ~200.000 Hits (~85% of the documents
have LANG_PRIO=100).
In the wrapper class i determine the real length() of Hits (without the docs
beneath myThresh with a kind of quicksort(?))
private int getLength(int nFrom, int nTo) {
int nHalf = (nFrom+(nTo-nFrom)/2);
if (nFrom == nTo) return nFrom;
if (score(nHalf)*100 < myThresh) {
return getLength(nFrom, nHalf);
}
return getLength(nHalf+1, nTo);
}
On server side this results to 2 IndexSearcher Calls:
search([EMAIL PROTECTED], null, 100)
search: 391ms
search([EMAIL PROTECTED], null, 220420)
search: 813ms
I think "getMoreDocs(int min)" doesn't work well with my queries, because it
prefetches to many TopDocs:
int n = min * 2; // double # retrieved
Additionally "getMoreDocs()" does score all docs on every call. So some work
is done which has already done in the first call.
It's a bit tricky to know how many docs are needed in advance :-\
Second try was to use a ThresholdHitCollector.
When calling
searcher.search(query, filter, new ThresholdHitCollector(...));
i got the following exception:
java.io.NotSerializableException: org.apache.lucene.search.MultiSearcher$1
java.rmi.MarshalException: error marshalling arguments; nested exception is:
java.io.NotSerializableException:
org.apache.lucene.search.MultiSearcher$1
at sun.rmi.server.UnicastRef.invoke(Unknown Source)
at org.apache.lucene.search.RemoteSearchable_Stub.search(Unknown Source)
at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:245)
at org.apache.lucene.search.Searcher.search(Searcher.java:110)
...
My current approach is to call
searcher.search(query, filter);
on client side and subclassing IndexSearcher on server side.
The class MyIndexSearcher uses the ThresholdHitCollector:
public TopDocs search(Weight weight, Filter filter, final int nDocs)
throws IOException {
// nDocs is ignored. return all TopDocs instead
Scorer scorer = weight.scorer(getIndexReader());
if (scorer == null) return new TopDocs(0, new ScoreDoc[0]);
ThresholdHitCollector hc = new ThresholdHitCollector();
hc.setScoreThreshold(0.0025f);
hc.setFilter(filter);
scorer.score(hc);
return new TopDocs(hc.getTotalHits(), hc.getScoreDocs());
}
search([EMAIL PROTECTED], null, 50)
search: 234ms
Unfortunately this solution has 2 disadvantages:
- threshold works on raw scores
- lucene has to be patched (access privileges, making Hits an Interface, ...)
+ but: good performance (for me)
1.)
Is it possible to get normalized scores in HitCollector?
(e.g. via custom Similarity?)
2.)
Is it a good idea to patch Lucene for subclassing?
Oh oh, i hope somebody does understand my weird mail ;)
Thanks,
Kai Gulzau
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]