On 04/04/2013 10:59, Paul Taylor wrote:
On 27/02/2013 10:28, Uwe Schindler wrote:
Hi Paul,
QueryParser and MTQ's rewrite method have nothing to do with each
other. The rewrite method is (explained as simple as possible) a
class that is responsible to "rewrite" a MultiTermQuery to another
query type (generally a query that allows to add "Term" instances,
e.g. BooleanQuery of TermQuery or DisjunctionMaxQuery of Terms). The
rewrite method takes the "filtered" terms enum provided by the query
and creates a combined query out of it. Lucene ships with some
already implemented rewrite methods based on abstract classes that
handle the most common cases:
- ScoringRewrite handles the case where you want to collect the terms
from the termsenum and place them as "clauses" in a top level query
(e.g. a scoring BooleanQuery). You have to implement 2 abstract
methods that produce the top-level query and create the clauses, that
can be added to the top-level query. This class is generic to the
top-level query, as the clauses can only be added to the correct
top-level query. To make this work without casting, all methods are
redefined to take the generics classes. So addClause() takes the
generic top level query and a term. The rewrite method by itself
returns the top level query
- TopTermsRewrite is similar, but has a major difference: It has
almost same API, but the internal implementation of this class is
different: It never hits the Boolean Max Clause Count, because the
collected terms are ordered in a priority queue and only the
top-ranking terms are added to the resulting top-level query. This
class is also generified against the top-level query. Rewrite returns
an instance of the top-level query.
- The very base class MultiTermQuery.RewriteMethod is most flexible
but has no concrete implementation. It is used to rewrite a MTQ to a
query that is not a composite top-level one with a number of terms,
e.g. a filter that’s handled in a totally different stage of rewriting.
You can use the same MTQ rewrite for different MTQ types, e.g. you
can rewrite a FuzzyQuery to a simple ConstantScore Query or a
DisjunctionMaxQuery - but only the second one makes sense. On the
other hand it makes no sense to rewrite Prefix and Wildcard using
TopTermsRewrite, as those queries have terms enums withouth term
boosts (only Fuzzy assigns a boost to every term depending on
levensthein distance).
Things to note:
A rewrite method in MTQ would never rewrite to another MTQ like
PrefixQuery - it could do this, but only in the lowest base class
(see above)! -> If you rely on that, your code has a major problem.
In that case the correct behavior would be to create a completely
"own"oal.search.Query (that not extends MTQ) and implement a standard
rewrite logic. This query could of course rewrite to MTQ's like Fuzzy
or Prefix. IndexSearcher rewrites the query until it is completely
rewritten, so your custom query would create a PrefixQuery which
itself rewrites to something else.
QueryParser is just a factory for queries, its not related to MTQ. It
only has an option to set a "default" method for common queries. But
as you have a custom QueryParser, you can return the queries,
configured like you want, to the caller.
Uwe
Hi Uwe
Okay, think I have it now. Now have a working rewrite method for Fuzzy
Queries
public static class FuzzyTermRewrite<Q extends
DisjunctionMaxQuery> extends TopTermsRewrite<Query> {
public FuzzyTermRewrite(int size) {
super(size);
}
@Override
protected int getMaxSize() {
return BooleanQuery.getMaxClauseCount();
}
@Override
protected DisjunctionMaxQuery getTopLevelQuery() {
return new DisjunctionMaxQuery(0.1f);
}
@Override
protected void addClause(Query topLevel, Term term, int
docCount, float boost, TermContext states) {
final Query tq = new ConstantScoreQuery(new
TermQuery(term, states));
tq.setBoost(boost);
((DisjunctionMaxQuery)topLevel).add(tq);
}
}
and now writing a separate class for Prefix Queries so it does
actually modify the idf
Paul
and this is my prefix rewrite method:
/**
*
* Prefix matches are rewritten to a DisjunctionMaxQuery instead of
the more usual BooleanQuery so that
* if search term matches multiple fields we just take the best
field rather summing all matches like a boolean
* query. The 0.1 for tiebreaker is to favour documents that
contain all words rather than the same word in multiple
* fields.
*
* We set the idf the same as an exact match so that a wildcard
match to a term which happens to be rarer than
* the exact term we were searching for does not get an unfairly
high idf.
*
*/
public static class PrefixTermRewrite extends
MultiTermQuery.RewriteMethod {
private TFIDFSimilarity similarity;
private FuzzyTermRewrite rewrite;
public PrefixTermRewrite(int size) {
this.rewrite = new FuzzyTermRewrite(size);
this.similarity = new DefaultSimilarity();
}
protected float getQueryBoost(final IndexReader reader, final
MultiTermQuery query)
throws IOException {
float idf = 1f;
float df;
PrefixQuery fq = (PrefixQuery) query;
df = reader.docFreq(fq.getPrefix());
if(df>=1)
{
//Same as idf value for search term, 0.5 acts as length
norm
idf = (float)Math.pow(similarity.idf((int) df,
reader.numDocs()),2) * 0.5f;
}
return idf;
}
@Override
public Query rewrite(final IndexReader reader, final
MultiTermQuery query) throws IOException {
DisjunctionMaxQuery dmq =
(DisjunctionMaxQuery)rewrite.rewrite(reader, query);
float idfBoost = getQueryBoost(reader, query);
Iterator<Query> iterator = dmq.iterator();
while(iterator.hasNext())
{
Query next = iterator.next();
next.setBoost(next.getBoost() * idfBoost);
}
return dmq;
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org