Re: DistributingMultiFieldQueryParser and DisjunctionMaxQuery

Chuck Williams Wed, 14 Dec 2005 17:02:53 -0800

----- Original Message -----
*From:* Miles Barr <[EMAIL PROTECTED]>
*To:* java-user@lucene.apache.org
*Sent:* 12/14/2005 12:43:04 AM
*Subject:* DistributingMultiFieldQueryParser and DisjunctionMaxQuery



>On Tue, 2005-12-13 at 11:51 -0800, Chris Hostetter wrote:
>  
>
>>As i mentioned in the comments for LUCENE-323,
>>DistributingMultiFieldQueryParser seems to be more of a demo of what's
>>possible with DisjunctionMaxQuery -- not neccessarily a full fledged
>>QueryParser.  I think that's why it wasn't commited (even though
>>DisjunctionMaxQuery was), and the issue was left open.
>>    
>>
It is not intended to be a demo.  I use it for real and believe it is
complete and correct.  At the moment, it uses a QueryParser api that is
deprecated in the latest Lucene source, but it still works.  There are a
couple To Do's marked where it could be improved, but its current
behavior is acceptable.

>I've only had a quick play with it so this problem is probably down to
>my misuse of the class but I found that negations weren't handled
>properly. e.g.
>
>fruit AND -apples
>
>The DistributingMultiFieldQueryParser would correctly generate a query
>that would find fruit in one of the fields, but would only ensure that
>apples did not appear in one field, not not appear in all the fields,
>which was the behaviour I wanted. Hence negations didn't really work if
>the term appeared in more than one field.
>  
>
>
>I just tested putting together a prohibited boolean query with a
>DisjunctionMaxQuery programmatically rather than via the
>DistributingMultiFieldQueryParser and it works fine. 
>  
>
Would you mind submitting a test case that shows the problem as I cannot
replicate this?  E.g the attached test cases runs an equivalent query,
"fruit AND -plum" and works properly.  Negation should work fine in
general.  The transformation performed on BooleanQuery's is this:
  BooleanQuery (q1.occur1 ... qn.occurn) applied to fields (f1 ... fm)
==> ((q1 applied to f1...fm).occur1 ... ((qn applied to f1...fm).occurn))
So for MUST_NOT clauses, the NOT is scoped around the OR over the fields
and so the value can be found in no fields.

If there are bugs in DistributingMultiFieldQueryParser, I will be happy
to fix them.  If there is some specific reason it is not deemed suitable
to commit, please let me know.  It is much harder to use
DisjunctionMaxQuery without this parser.

FYI, here is the output I get from the attached test case (running my
version of DistributingMultiFieldQueryParser, which is also attached in
case it is different than what you have):

------------- Standard Output ---------------
Collection:
  uid:{doc1}    title:{fruit}    body:{apple, pear, plum}
  uid:{doc2}    title:{plum fruit}    body:{delicious ripe plum}
  uid:{doc3}    title:{fruit medley}    body:{peach, banana, pear, cherry}

testParse
  Query:{fruit AND -plum} ==> +(title:fruit^5.0 | body:fruit)~0.1
-(title:plum^5.0 | body:plum)~0.1

  title:{fruit medley}    body:{peach, banana, pear, cherry}

------------- ---------------- ---------------

Chuck

/*
 * DistributingMultiFieldQueryParser.java
 *
 * Created on December 13, 2004, 11:32 AM
 */

package org.apache.lucene.queryParser;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.DisjunctionMaxQuery;
import org.apache.lucene.search.MultiTermQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.RangeQuery;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.WildcardQuery;

/**
 * Version of MultiFieldQueryParser to work with DisjunctionMaxQuery.
 * Extracted from real app -- NOT COMPLETE as a Lucene QueryParser.
 * @author Williams
 */
public class DistributingMultiFieldQueryParser {
    
    /** The tieBreakerMultiplier used in the automatically generated DisjunctionMaxQuery's that search for a single term
     * in all of the default fields */
    public static float EXPANDED_FIELD_TIE_BREAKER = 0.1f;
    
    private static final String DEFAULT_FIELD = "DMFQP_DEFAULT";
    
    /** Parse a query to search multiple fields, using DEFAULT_OPERATOR_AND.
     * @param queryString the string containing the query in standard Lucene surface form
     * @param field the names of the fields that terms and phraes search by default
     * @param boosts the corresponding boosts to be placed on fields
     * @param analyzer the analyzer to use for lexical analysis of query
     * @return the parsed Query
     */
    public static Query parse(String queryString, String[] fields, float[] boosts, Analyzer analyzer) throws ParseException {
        return expandDefaultFields(QueryParser.parse(queryString, DEFAULT_FIELD, analyzer), fields, boosts);
    }
    
    /** Parse a query to search multiple fields, using DEFAULT_OPERATOR_AND.
     * @param queryString the string containing the query in standard Lucene surface form
     * @param field the names of the fields that terms and phraes search by default
     * @param boosts the corresponding boosts to be placed on fields
     * @param analyzer the analyzer to use for lexical analysis of query
     * @return the parsed Query
     */
    public static Query parseDefaultAnd(String queryString, String[] fields, float[] boosts, Analyzer analyzer) throws ParseException {
        QueryParser qp = new QueryParser(DEFAULT_FIELD, analyzer);
        qp.setDefaultOperator(QueryParser.AND_OPERATOR);
        return expandDefaultFields(qp.parse(queryString), fields, boosts);
    }
     
    /**
     * Rewrite query such that every query term that does not specify a field searches all of the specified fields.
     * @param query the parsed query prior to field expansion of unfielded terms
     * @param fields the fields across which terms with no explicit fields should search
     * @param boosts the respective boosts for fields, in 1-1 correspondence
     * @throws ParseException if we cannot interpret the query
     * @return a new Query in which the unfielded terms will search the fields with the corresponding boosts
     */
    public static Query expandDefaultFields(Query query, String[]fields, float[] boosts) throws ParseException {
        if (query instanceof TermQuery) {
            Term t = ((TermQuery) query).getTerm();
            if (t.field().equals(DEFAULT_FIELD)) {
                String text = t.text();
                DisjunctionMaxQuery mdq = new DisjunctionMaxQuery(EXPANDED_FIELD_TIE_BREAKER);
                mdq.setBoost(query.getBoost());
                for (int i = 0; i < fields.length; i++) {
                    TermQuery tq = new TermQuery(new Term(fields[i], text));
                    tq.setBoost(boosts[i]);
                    mdq.add(tq);
                }
                return mdq;
            }
            else return query;
        }
        else if (query instanceof PhraseQuery) {
            PhraseQuery pq = (PhraseQuery)query;
            Term[] terms = pq.getTerms();
            if (terms.length == 0) return query;    // No need to expand empty phrase!
            if (terms[0].field().equals(DEFAULT_FIELD)) {
                int slop = pq.getSlop();
                DisjunctionMaxQuery mdq = new DisjunctionMaxQuery(EXPANDED_FIELD_TIE_BREAKER);
                mdq.setBoost(query.getBoost());
                for (int i=0; i<fields.length; i++) {
                    PhraseQuery npq = new PhraseQuery();
                    npq.setBoost(boosts[i]);
                    npq.setSlop(slop);
                    for (int j=0; j<terms.length; j++) {
                        npq.add(new Term(fields[i], terms[j].text()));
                    }
                    mdq.add(npq);
                }
                return mdq;
            }
            else return query;
        }
        else if (query instanceof BooleanQuery) {
            BooleanClause[] clauses = ((BooleanQuery) query).getClauses();
            BooleanQuery bq = new BooleanQuery();
            bq.setBoost(query.getBoost());
            for (int i=0; i<clauses.length; i++)
                bq.add(expandDefaultFields(clauses[i].getQuery(), fields, boosts), clauses[i].getOccur());
            return bq;
        }
        else if (query instanceof RangeQuery) {
            if (((RangeQuery)query).getField().equals(DEFAULT_FIELD))
                throw new ParseException("Range queries are meaningless without specifying a range field:  " + query.toString());
            return query;
        }
        else if (query instanceof MultiTermQuery) { // WildcardQuery or FuzzyQuery
            // ***** TODO:  DisjunctionMaxQuery vs. BooleanQuery for expansion is not clear in this case.
            // Ideally would get the disjunction of terms, put those in BooleanQuery, and then use
            // DisjunctionMaxQuery on each term to expand the fields.
            Term t = ((MultiTermQuery) query).getTerm();
            if (t.field().equals(DEFAULT_FIELD)) {
                String text = t.text();
                DisjunctionMaxQuery mdq = new DisjunctionMaxQuery(EXPANDED_FIELD_TIE_BREAKER);
                mdq.setBoost(query.getBoost());
                for (int i = 0; i < fields.length; i++) {
                    Term nt = new Term(fields[i], text);
                    Query nq;
                    if (query instanceof WildcardQuery) nq = new WildcardQuery(nt);
                    else if (query instanceof FuzzyQuery) nq = new FuzzyQuery(nt);
                    else throw new ParseException("Unknown type of MultiTermQuery:  " + query.toString());
                    nq.setBoost(boosts[i]);
                    mdq.add(nq);
                }
                return mdq;
            }
            else return query;
        }
        else if (query instanceof PrefixQuery) {
            // ***** TODO: See comment for MultiTermQuery.  Same applies here.
            Term t = ((PrefixQuery) query).getPrefix();
            if (t.field().equals(DEFAULT_FIELD)) {
                String text = t.text();
                DisjunctionMaxQuery mdq = new DisjunctionMaxQuery(EXPANDED_FIELD_TIE_BREAKER);
                mdq.setBoost(query.getBoost());
                for (int i = 0; i < fields.length; i++) {
                    Term nt = new Term(fields[i], text);
                    Query nq = new PrefixQuery(nt);
                    nq.setBoost(boosts[i]);
                    mdq.add(nq);
                }
                return mdq;
            }
            else return query;            
        }
        /* None of DisjunctionMaxQuery, PhrasePrefixQuery nor FilteredQuery should be possible results of a query string parse. */
        else throw new ParseException("UNKNOWN type of query:  " + query.getClass() + " PRINTS AS " + query.toString());
    }
 
}

/*
 * DistributingMultiFieldQueryParserTest.java
 * JUnit based test
 *
 * Created on December 14, 2005, 12:38 PM
 */

package org.apache.lucene.queryParser;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import junit.framework.*;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

/**
 *
 * @author chuck
 */
public class DistributingMultiFieldQueryParserTest extends TestCase {
    
    Directory index = new RAMDirectory();
    String[] defaultFields = {"title", "body"};
    float[] defaultBoosts = {5.0f, 1.0f};
    Analyzer analyzer = new StandardAnalyzer();
    
    public DistributingMultiFieldQueryParserTest(String testName) {
        super(testName);
    }

    protected void setUp() throws Exception {
        IndexWriter writer = new IndexWriter(index, analyzer, true);
        System.out.println("Collection:");
        writer.addDocument(createDoc("doc1", "fruit", "apple, pear, plum"));
        writer.addDocument(createDoc("doc2", "plum fruit", "delicious ripe plum"));
        writer.addDocument(createDoc("doc3", "fruit medley", "peach, banana, pear, cherry"));
        System.out.println();
        writer.close();
    }
    
    private Document createDoc(String uid, String title, String body) {
        Document doc = new Document();
        doc.add(new Field("uid", uid, Store.YES, Index.NO));
        doc.add(new Field("title", title, Store.YES, Index.TOKENIZED));
        doc.add(new Field("body", body, Store.YES, Index.TOKENIZED));
        System.out.println("  " + "uid:{" + uid + "}    title:{" + title + "}    body:{" + body + "}");
        return doc;
    }

    protected void tearDown() throws Exception {
        index.close();
    }

    public static Test suite() {
        TestSuite suite = new TestSuite(DistributingMultiFieldQueryParserTest.class);
        
        return suite;
    }

    /**
     * Test of parse method, of class org.apache.lucene.queryParser.DistributingMultiFieldQueryParser.
     */
    public void test() {
        System.out.println("testParse");
        
        try {
            IndexSearcher searcher = new IndexSearcher(index);
            String querystr1 = "fruit AND -plum";
            Query query1 = DistributingMultiFieldQueryParser.parse(querystr1, defaultFields, defaultBoosts, analyzer);
            System.out.println("  Query:{" + querystr1 + "} ==> " + query1);
            System.out.println();
            List<Document> results = query(searcher, query1);
            for (Document doc : results)
                System.out.println("  title:{" + doc.get("title") + "}    body:{" + doc.get("body") + "}");
            System.out.println();
            assertEquals(results.size(), 1);
            assertEquals(results.get(0).get("uid"), "doc3");

        } catch (Exception e) {
            System.out.println(e.getMessage());
            fail("Test caused exception");
        }
    }
    
    private List<Document> query(IndexSearcher searcher, Query query) throws ParseException, IOException {
        Hits hits = searcher.search(query);
        List<Document> results = new ArrayList<Document>();
        for (int i=0; i<hits.length(); i++)
            results.add(hits.doc(i));
        return results;    
    }
    
}

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: DistributingMultiFieldQueryParser and DisjunctionMaxQuery

Reply via email to