Doug Cutting wrote:

David Spencer wrote:

I worked w/ Chuck to get up a test page that shows search results with 2 versions of Similarity side by side.


David,

This looks great!  Thanks for doing this.

Is the default operator AND or OR? It appears to be OR, but it should probably be AND. That's become the industry standard since QueryParser was first written. Also, any chance we can get explanations for hits?

It is difficult to decipher what's doing what. I think we should separately evaluate query formulation and boosting from changes to tf/idf.

We ought to first compare searching body only, ignoring titles, then subsequently try different query formulations over multiple fields with a fixed weighting algorithm. Yes, ignoring titles when searching wikipedia might not be the best approach, but the point is not to over-optimize for wikipedia but rather to find algorithms that work well with general text collections. Removing titles makes the problem harder, which should in turn make it easier to see deficiencies.

Simpler yet, we ought to first try body-only with no proximity, just AND, in order to select good tf/idf formulations. Then we should add auto-proximity searching into the mix, and finally add multiple fields. Does this make sense?

MultiFieldQueryParser is known to be deficient. A better general-purpose multi-field query formulator might be like that used by Nutch. It would translate a query "t1 t2" given fields f1 and f2 into something like:

+(f1:t1^b1 f2:t1^b2)
+(f2:t1^b1 f2:t2^b2)
f1:"t1 t2"~s1^b3
f2:"t1 t2"~s2^b4

Where b1 and b2 are boosts for f1 and f2, and b3 and b4 are boosts for phrase matching in f1 and f2, and s1 and s2 are slop for f1 and f2. We'd really only need to vary b1 and b3, and could use 1.0 for b2 and b4 and infinity for s1 and s2.

Do folks agree that this is a good general formulation? If so, would someone like to contribute a version of MultiFieldQueryParser that implements this? The API should probably be something like:

  static Query parse(String queryString,
                     String[] fields,
                     float[] boolBoosts,
                     float[] phraseBoosts,
                     int[] slops);

A simplified version might be:

  static Query parse(String queryString,
                     String[] fields,
                     float[] boosts);


I think I've done the code (but no, test URL we're playing with is not updated).


[1] Test Driver:


// 1a: "AND" semantics q = formMegaQuery( "t1 t2", null, FIELDS, BOOL_BOOSTS, PH_BOOSTS, SLOPS, true); // true -> AND

o.println( q.toString( "f2"));


// 1b: same as 1a but OR semantics q = formMegaQuery( "t1 t2", null, FIELDS, BOOL_BOOSTS, PH_BOOSTS, SLOPS, false);

o.println( q.toString( "f2"));

// 1c: more terms
q = formMegaQuery( "t1 t2 t3 t4 t5",
        null,
        FIELDS,
        BOOL_BOOSTS,
        PH_BOOSTS,
        SLOPS,
        false);

o.println( q.toString( "f2"));


[2] Output

+(f1:t1^2.0 t1) +(f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5

(f1:t1^2.0 t1) (f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5

(f1:t1^2.0 t1) (f1:t2^2.0 t2) (f1:t3^2.0 t3) (f1:t4^2.0 t4) (f1:t5^2.0 t5) f1:"t1 t2 t3 t4 t5"~5^3.0 "t1 t2 t3 t4 t5"~2^1.5



[3] Code - more or less as per Doug's spec but I pass in an optional Analyzer for parsing the search string, and the last arg, 'mand', determines "AND semantics".

public static Query formMegaQuery( String srch,
                                   Analyzer a,
                                   String[] fields,
                                   float[] boolBoosts,
                                   float[] phraseBoosts,
                                   int[] slops,
                                   boolean mand)
{
    if ( a == null) a = new WhitespaceAnalyzer();
    BooleanQuery bq = new BooleanQuery();

    TokenStream ts = a.tokenStream( "contents", new StringReader( srch));
    org.apache.lucene.analysis.Token toke;
    try
    {
        TermQuery[] tt = new TermQuery[ fields.length];
        List lis = new LinkedList();

// [1] For every word make a clause so it matches some field
while ( (toke = ts.next()) != null) // for every token in search string
{
String word = toke.termText();
if ( ! lis.add( word)) continue; // ignore dup words


BooleanQuery tmp = new BooleanQuery();
for ( int i = 0; i < tt.length; i++)
{
tt[ i] = new TermQuery( new Term( fields[ i], word));
tt[ i].setBoost( boolBoosts[ i]);
tmp.add( tt[ i], false, false);
}
bq.add( tmp, mand, false); // must match one if 'mand' is true (AND semantics)
}


        String[] ar = (String[]) lis.toArray( new String[ 0]);
        for ( int j = 0; j < fields.length; j++) // for every field
        {
            PhraseQuery pq = new PhraseQuery();
            for ( int i = 0; i < ar.length; i++)
                pq.add( new Term( fields[ j], ar[ i]));
            pq.setSlop( slops[ j]);
            pq.setBoost( phraseBoosts[ j]);
            bq.add( pq, false, false); // make opt
        }
    }
    catch( IOException ioe)
    {
        // can't happen as we're using a string reader
    }
    return bq;
}


This could use infinity for slops and assume boolBoosts[i] == phraseBoosts[i].


Doug


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to