Re: New PhrasePrefixQuery.java

Doug Cutting Wed, 20 Nov 2002 15:04:51 -0800

Konrad Scherer wrote:

I think it would be good to get this functionality into the Query parser. There is currently a gap between what is trivially available in the query parser (strings with wildcard characters) and the PhrasePrefixQuery API (an array of terms). What it seems to me is needed is just a utility method somewhere that expands a wildcarded string into an array of terms. This is probably best done in PhrasePrefixQuery.scorer, when an IndexReader is available. So the approach I would suggest is extending the API of PhrasePrefixQuery with a method like:
PhrasePrefixQuery.addTermPrefix(Term term);
or
PhrasePrefixQuery.addWildcardTerm(Term term);
where the term.text() contains either a term prefix or a wildcard pattern. Then, in the scorer() implementation this can be expanded. PhrasePrefixQuery would then need to do some bookkeeping to identify which terms need expansion.

Does this make sense?

Yes it makes sense, but there is a problem. To expand a wildcard, an IndexReader is necessary. I choose the prepare method because then the wildcard term can be expanded before the function sumOfSquaredWeights is called.

Good point. Keep in mind. with MultiSearcher, more than one IndexReader may be involved. The correct thing to do is to take the union of the wildcard expansions across all readers. More on this below.

I must admit to not understanding the weighting system at all == I haven't taken the time to think about it yet.

The value of sumOfSquaredWeights only alters the absolute value of scores, not relative ranking. It is part of code that attempts to normalize scores based on the query, so that scores for different queries are somewhat comparable. However absolute values of Lucene scores are not very meaningful anyway. So it might be acceptable to take shortcuts with the value returned and just use a constant value for wildcarded terms. Unfortunately, sumOfSquaredWeights also has a side-effect of computing the idf weight for the phrase, which does affect ranking. So the correct solution is more complex.

The only way I can see to fix this correctly would be to change the Searchable, Query and Scorer APIs as follows:

1. Add a term expansion or term iteration method to Searchable, so that prefix and wildcard expansion can be done across all IndexReaders in a MultiSearcher before term weighting.

2. Change sumOfSquaredWeights implementations not to alter the query, but rather to just compute the returned value, using Searchable methods.

3. Move the normalize() method from Query to Scorer and eliminate the Query.prepare() method.

4. Change scorer implementations to compute idfs using Searchable methods.

Its a shame to compute the IDFs in both the query's sumOfSquared weights methods, and again in the scorer. Perhaps Searchable implementations could cache docFreq() values so that this is not expensive.

As you can see, these are rather involved changes, not to be done lightly, but I think they would also fix some longstanding bugs. In the short term, the simple approach might be to only operate correctly when an IndexSearcher is used, and not when a MultiSearcher is used. Sigh. Longer term, I will add revising these APIs to my queue of tasks.

Doug

--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: New PhrasePrefixQuery.java

Reply via email to