MoreLikeThis Query generator - Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

David Spencer Thu, 12 Feb 2004 12:04:46 -0800

Otis Gospodnetic wrote:

Lots of params in that mlt method, but it seems flexible.

Code rewritten, automagically chooses lots of defaults, lets you override
the defs thru the static vars at the bottom or the non-static vars also at the bottom.

Quickest way to use, choosing all defaults is:

MoreLikeThis mlt = new MoreLikeThis( r); // pass in IndexReader
Query q = mlt.like( x); // x is URL/InputStream/File "source doc"
// then pass q to IndexSearcher

To see what parameters it's using display the ret value from mlt.describeParams().

The main driver takes args of "-i INDEX" and then "-f FILE" or "-url URL" and shows
the params and up to 25 "similar" docs. HTML docs won't work right as I'm just
blindly using StandardAnalyzer on the stream. Trying to keep everything simple
so it fits in one file....

I'll try it.


Small optimization suggestion: use int[] with a single element for that
words Map, instead of creating lots of Integer()s.

Done sort of - put in my own mutable int to avoid object creation.

Actually, maybe
JVMs are smart and don't allocate new objects for the same int wrapped

Dubious that they do..

in Integer.... I don't know for sure.

Otis


--- David Spencer <[EMAIL PROTECTED]> wrote:

Doug Cutting wrote:

Karl Koch wrote:

Do you know good papers about strategies of how
to select keywords effectivly beyond the scope of stopword lists

and

stemming?

Using term frequencies of the document is not really possible

since

lucene
is not providing access to a document vector, isn't it?

Lucene does let you access the document frequency of terms, with 
IndexReader.docFreq().  Term frequencies can be computed by 
re-tokenizing the text, which, for a single document, is usually

fast

enough.  But looking up the docFreq() of every term in the document

is

probably too slow.

You can use some heuristics to prune the set of terms, to avoid 
calling docFreq() too much, or at all.  Since you're trying to 
maximize a tf*idf score, you're probably most interested in terms

with

a high tf. Choosing a tf threshold even as low as two or three will

radically reduce the number of terms under consideration.  Another 
heuristic is that terms with a high idf (i.e., a low df) tend to be

longer.  So you could threshold the terms by the number of

characters,

not selecting anything less than, e.g., six or seven characters.

With

these sorts of heuristics you can usually find small set of, e.g.,

ten

or fewer terms that do a pretty good job of characterizing a

document.

It all depends on what you're trying to do.  If you're trying to

eek

out that last percent of precision and recall regardless of 
computational difficulty so that you can win a TREC competition,

then

the techniques I mention above are useless.  But if you're trying

to

provide a "more like this" button on a search results page that

does a

decent job and has good performance, such techniques might be

useful.

An efficient, effective "more-like-this" query generator would be a

great contribution, if anyone's interested.  I'd imagine that it

would

take a Reader or a String (the document's text), an Analyzer, and 
return a set of representative terms using heuristics like those 
above.  The frequency and length thresholds could be parameters,

etc.


Well I've done a prelim impl of the above. Maybe someone could
proofread 
my code.
The code is hot off the presses and seems to work...

Questions are:
[a] is the code right
[b] are any more (less) params needed to properly "genericize" the 
algorithm? e.g. max "words" to return?
[c] I can tweak the code to be a little more usable..does it make
sense 
to return, say, a Query?
[d] then the eternal question - I think these things are interesting
but 
my theory is that Queries (is-a Query impls) which are not
implemented 
into the QueryParser will never really be used....

Anyway:

There are two parts - the main() quick test I did which is not set up
to 
run on another system
right now but shows how the mlt rountine (mlt->MoreLikeThis) is
called:



    public static void main( String[] a)
        throws Throwable
    {
        Hashtable stopTable = StopFilter.makeStopTable( 
StopAnalyzer.ENGLISH_STOP_WORDS);
        String fn = "c:/Program Files/Apache 
Group/Apache/htdocs/manual/vhosts/index.html.en";
        PrintStream o = System.out;
        final IndexReader r = IndexReader.open( "localhost_index");

        String body = new com.tropo.html.HTMLTextMuncher( new 
FileInputStream( fn)).getText();
        PriorityQueue q = mlt(  new StringReader( body), 
getDefAnalyzer(), r, "contents", 2, stopTable, 0, 0);

        o.println( "res..." + q.size());
        o.println();

        Object cur;
        while ( (cur = q.pop()) != null)
        {
            Object[] ar = (Object[]) cur;
            o.println( ar[ 0] + " = " + ar[ 1]);
        }


    }




And the impl which will compile with appropriate imports.

import java.io.*;
import java.util.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.document.*;
import org.apache.lucene.search.*;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.util.*;


    /**
     * Find words for a more-like-this query former.
     *
     * @param r the reader that has the content of the document
     * @param a the analyzer to parse the reader with
     * @param field the field of interest in the document
     * @param minFreq filter out terms that occur less than this in
the 
document
     * @param stop a table of stopwords to ignore
     * @param minLen ignore words less than this length or pass in 0
to 
not use this
     * @param maxLen ignore words greater than this length or pass in
0 
to not use this
     * @return a priority queue ordered by docs with the largest
score 
(tf*idf)
     */
    public static PriorityQueue mlt( Reader r,
                                     Analyzer a,
                                     IndexReader ir,
                                     String field,
                                     int minFreq,
                                     Hashtable stop,
                                     int minLen,
                                     int maxLen)
        throws IOException
    {
        Similarity sim = new DefaultSimilarity(); // for idf
        TokenStream ts = a.tokenStream( field, r);
        org.apache.lucene.analysis.Token toke;
        Map words = new HashMap();
        while ( (toke = ts.next()) != null)
        {
            String word = toke.termText().toLowerCase();
            if ( stop.contains( word)) continue;
            int len = word.length();
            if ( minLen > 0 && len < minLen) continue;
            if ( maxLen > 0 && len < maxLen) continue;
            if ( words.containsKey( word))
            {
                Integer x = (Integer) words.get( word);
                words.put( word, new Integer( x.intValue()+1));
            }
            else
                words.put( word, new Integer( 1));
        }
        Iterator it = words.keySet().iterator();

        int numDocs = ir.numDocs();
        FreqQ res = new FreqQ( words.size());
       
        while ( it.hasNext())
        {
            String word = (String) it.next();
            int tf = ((Integer) words.get( word)).intValue();
            if (tf < minFreq) continue;
            Term t = new Term( field, word);
            int docFreq = ir.docFreq( t);
            float idf = sim.idf( docFreq, numDocs);
            float score = tf * idf;
            res.insert( new Object[] { word, new Float( score),
                                       new Float( idf),
                                       new Integer( docFreq)});
        }
        return res;
    }

    /**
     *
     */
    static class FreqQ
        extends PriorityQueue
    {
        FreqQ( int s)
        {
            initialize( s);
        }

        protected boolean lessThan(Object a, Object b)
        {
            Object[] aa =(Object[]) a;
            Object[] bb =(Object[]) b;
            Float fa = (Float) aa[ 1];
            Float fb = (Float) bb[ 1];
            return fa.floatValue() > fb.floatValue();
        }
    }

Doug

---------------------------------------------------------------------

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:

[EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

package com.tropo.lucene;

import java.io.*;
import java.util.*;
import java.net.*;

import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.*;
import org.apache.lucene.search.*;
import org.apache.lucene.queryParser.*;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.util.*;

//import com.tropo.html.*;
//import com.tropo.lang.*;

/**
 * Generate "more like this" similarity queries.
 * Based on this mail:
 * <p>
 * <pre>
Lucene does let you access the document frequency of terms, with 
IndexReader.docFreq().  Term frequencies can be computed by re-tokenizing the text, 
which, for a single document, is usually fast enough.  But looking up the docFreq() of 
every term in the document is probably too slow.

You can use some heuristics to prune the set of terms, to avoid calling docFreq() too 
much, or at all.  Since you're trying to maximize a tf*idf score, you're probably most 
interested in terms with a high tf. Choosing a tf threshold even as low as two or 
three will radically reduce the number of terms under consideration.  Another 
heuristic is that terms with a high idf (i.e., a low df) tend to be longer.  So you 
could threshold the terms by the number of characters, not selecting anything less 
than, e.g., six or seven characters.  With these sorts of heuristics you can usually 
find small set of, e.g., ten or fewer terms that do a pretty good job of 
characterizing a document.

It all depends on what you're trying to do.  If you're trying to eek out that last 
percent of precision and recall regardless of computational difficulty so that you can 
win a TREC competition, then the techniques I mention above are useless.  But if 
you're trying to provide a "more like this" button on a search results page that does 
a decent job and has good performance, such techniques might be useful.

An efficient, effective "more-like-this" query generator would be a great 
contribution, if anyone's interested.  I'd imagine that it would take a Reader or a 
String (the document's text), an Analyzer, and return a set of representative terms 
using heuristics like those above.  The frequency and length thresholds could be 
parameters, etc.

Doug
</pre>
 */
public final class MoreLikeThis 
{
        /**
         *
         */
        public MoreLikeThis( IndexReader ir)
        {
                this.ir = ir;
        }

        /**
         * Return a query that will return docs like the passed file.
         */
        public Query like( File f)
                throws IOException
        {
                return like( new FileReader( f));
        }
        
        /**
         * Return a query that will return docs like the passed URL.
         */
        public Query like( URL u)
                throws IOException
        {
                return like( new InputStreamReader( 
u.openConnection().getInputStream()));
        }
        
        /**
         * Return a query that will return docs like the passed stream.
         */
        public Query like( IndexReader ir, java.io.InputStream is)
                throws IOException
        {
                return like( new InputStreamReader( is));
        }               

        /**
         * Return a query that will return docs like the passed Reader.
         */
        public Query like( Reader r)
                throws IOException
        {
                PriorityQueue q = mlt( r);

                BooleanQuery query = new BooleanQuery();
                Object cur;
                int qterms = 0;
                float bestScore = 0;
                while ( ( (cur = q.pop()) != null))
                {
                        Object[] ar = (Object[]) cur;
                        if ( qterms == 0)
                        {
                                bestScore = (float)((Float)ar[1]).floatValue();
                        }
                        float myScore = (float)((Float)ar[1]).floatValue(); 

                        TermQuery tq = new TermQuery( new Term( fieldName, (String) 
ar[ 0]));
                        if ( boost)
                        {
                                tq.setBoost( myScore/bestScore);
                        }
                        try
                        {
                                query.add( tq , false, false);
                        }
                        catch( BooleanQuery.TooManyClauses ignore)
                        {
                                break;
                        }
                        qterms++;
                        if ( maxQueryTerms > 0 &&
                                 qterms >= maxQueryTerms)
                                break;
                }
                return query;
        }

        /**
         * Find words for a more-like-this query former.
         *
         * @param r the reader that has the content of the document
         * @param a the analyzer to parse the reader with
         * @param field the field of interest in the document
         * @param minFreq filter out terms that occur less than this in the document
         * @param stop a table of stopwords to ignore
         * @param minWordLen ignore words less than this length or pass in 0 to not 
use this
         * @param maxWordLen ignore words greater than this length or pass in 0 to not 
use this
         * @return a priority queue ordered by docs with the largest score (tf*idf)
         */
        public PriorityQueue mlt( Reader r)
                throws IOException
        {

                TokenStream ts = an.tokenStream( fieldName, r);
                org.apache.lucene.analysis.Token toke;
                Map words = new HashMap();
                while ( (toke = ts.next()) != null) // for every token
                {
                        String word = toke.termText();
                        
                        // filter out stop words
                        if ( stopTable != null && stopTable.contains( word)) continue;
                        
                        int len = word.length();
                        // filter out works too small or large
                        if ( minWordLen > 0 && len < minWordLen) continue;
                        if ( maxWordLen > 0 && len < maxWordLen) continue;

                        // increment frequency
                        Int cnt = (Int) words.get( word);
                        if ( cnt == null)
                                words.put( word, new Int());
                        else
                                cnt.x++;
                }

                // have collected all words in doc and their freqs


                int numDocs = ir.numDocs();
                FreqQ res = new FreqQ( words.size()); // will order words by score
                
                Iterator it = words.keySet().iterator();                
                while ( it.hasNext()) // for every word
                {
                        String word = (String) it.next();
                        
                        int tf = ((Int) words.get( word)).x; // term freq in the 
source doc
                        if (minTermFreq > 0 && tf < minTermFreq) continue; // filter 
out words that don't occur enough times in the source
                        
                        Term t = new Term( fieldName, word);
                        int docFreq = ir.docFreq( t);

                        if ( minDocFreq > 0 && docFreq < minDocFreq) continue; // 
filter out words that don't occur in enough docs
                        
                        if ( docFreq == 0) continue; // index update problem?
                        
                        float idf = sim.idf( docFreq, numDocs);
                        float score = tf * idf;

                        // only really need 1st 2 entries, other ones are for 
troubleshooting
                        res.insert( new Object[] { word, // the word
                                                                           new Float( 
score), // overall score
                                                                           new Float( 
idf), // idf
                                                                           new 
Integer( docFreq), // freq in all docs
                                                                           new 
Integer( tf)
                                                                                   }
                                                );
                }
                return res;
        }

        /**
         * Describe the parameters that control how the "more like this" query is 
formed.
         */ 
        public String describeParams()
        {
                StringBuffer sb = new StringBuffer();
                sb.append( "\t" + "maxQueryTerms  : " + maxQueryTerms + "\n");
                sb.append( "\t" + "minWordLen     : " + minWordLen + "\n");
                sb.append( "\t" + "maxWordLen     : " + maxWordLen + "\n");
                sb.append( "\t" + "fieldName      : \"" + fieldName + "\"\n");
                sb.append( "\t" + "boost          : " + boost + "\n");
                sb.append( "\t" + "minTermFreq    : " + minTermFreq + "\n");
                sb.append( "\t" + "minDocFreq     : " + minDocFreq + "\n");
                sb.append( "\t" + "stopWords      : " + (stopTable == null ? "none" : 
(""+stopTable.size())) + "\n");
                return sb.toString();
        }





        /**
         * Test driver.
         * Pass in "-i INDEX" and then either "-fn FILE" or "-url URL".
         */
        public static void main( String[] a)
                throws Throwable
        {
                String indexName = "localhost_index";
                String fn = "c:/Program Files/Apache 
Group/Apache/htdocs/manual/vhosts/index.html.en";
                URL url = null;
                for ( int i = 0; i < a.length; i++)
                {
                        if ( a[ i].equals( "-i"))
                                indexName = a[ ++i];
                        else if ( a[ i].equals( "-f"))
                                fn = a[ ++i];
                        else if ( a[ i].equals( "-url"))
                                url = new URL( a[ ++i]);
                }

                PrintStream o = System.out;
                IndexReader r = IndexReader.open( indexName);
                o.println( "Open index " +indexName + " which has " + r.numDocs() + " 
docs");
                
                MoreLikeThis mlt = new MoreLikeThis( r);

                o.println( "Query generation parameters:");
                o.println( mlt.describeParams());
                o.println(); 

                Query query = null;
                if ( url != null)
                {
                        o.println( "Parsing URL: " +url); 
                        query = mlt.like( url);
                }
                else if ( fn != null)
                {
                        o.println( "Parsing file: " +fn);
                        query = mlt.like( new File( fn));
                }

                o.println( "q: " + query);
                o.println();            
                IndexSearcher searcher = new IndexSearcher( indexName);

                Hits hits = searcher.search(query);
                int len = hits.length();
                o.println( "found: " + len + " documents matching");
                o.println();
                for (int i = 0; i < Math.min( 25, len); i++)
                {
                        Document d = hits.doc( i);
                        o.println( "score  : " + hits.score( i));
                        o.println( "url    : " + d.get( "url"));
                        o.println( "\ttitle  : " + d.get( "title"));
                        o.println( "\tsummary: " + d.get( "summary"));
                        o.println();
                }
        }

        /**
         *
         */
        private static class FreqQ
                extends PriorityQueue
        {
                FreqQ( int s)
                {
                        initialize( s);
                }

                protected boolean lessThan(Object a, Object b)
                {
                        Object[] aa =(Object[]) a;
                        Object[] bb =(Object[]) b;
                        Float fa = (Float) aa[ 1];
                        Float fb = (Float) bb[ 1];
                        return fa.floatValue() > fb.floatValue();
                }
        }

        /**
         * Use for frequencies and to avoid renewing Integers.
         */
        private static class Int
        {
                int x;
                Int()
                {
                        x = 1;
                }
        }

        /**
         * Default list of stopwords to ignore.
         */
        public static Hashtable defStopTable = StopFilter.makeStopTable( 
StopAnalyzer.ENGLISH_STOP_WORDS);
        
        /**
         * Default analyzer to parse source doc with.
         */     
        public static Analyzer defAnalyzer = new StandardAnalyzer();

        /**
         * Ignore terms with less than this frequency in the source doc.
         */     
        public static int defMinTermFreq = 2;
        
        /**
         *
         * Ignore words which do not occur in at least this many docs.   
         */     
        public static int defMinDocFreq = 5;    

        /**
         * Boost terms in query based on "score"?
         */
        public static boolean defBoost = false;

        /**
         * Default field name.
         */
        public static String defFieldName = "contents";

        /**
         * Ignore words less than this length or if 0 then this has no effect.
         */
        public static int defMinWordLen = 0;
        /**
         *
         * Ignore words greater than this length or if 0 then this has no effect.      
  
         */     
        public static int defMaxWordLen = 0;
        
        /**
         * Return a Query with no more than this many terms.
         * @see BooleanQuery#getMaxClauseCount
         * @see BooleanQuery#setMaxClauseCount   
         */     
        public static int defMaxQueryTerms = 25;        
        
        /**
         *
         */             
        

        /**
         * List of stop words what will be ignored.
         * Use StopFilter.makeStopTable( ...) to set this.
         */
        public Hashtable stopTable = defStopTable;
        
        /**
         * Analyzer that will be used to parse the doc.
         */     
        public Analyzer an = defAnalyzer;
        
        /**
         * Ignore words less freqent that this.
         */     
        public int minTermFreq = defMinTermFreq;
        /**
         * Ignore words which do not occur in at least this many docs.
         */     
        public int minDocFreq = defMinDocFreq;  
        
        /**
         * Should we apply a boost to the Query based on the scores?
         */     
        public boolean boost = defBoost;

        /**
         * Field name we'll analyze.
         */
        public String fieldName = defFieldName;

        /**
         * Ignore words if less than this len.
         */
        public int minWordLen = defMinWordLen;
        /**
         * Ignore words if greater than this len.          
         *
         */     
        public int maxWordLen = defMaxWordLen;
        
        /**
         * Don't return a query longer than this.
         */     
        public int maxQueryTerms = defMaxQueryTerms;

        /**
         * For idf() calculations.
         */
        public Similarity sim = new DefaultSimilarity(); 

        private final IndexReader ir;
        /**
         *
         */     
}

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

MoreLikeThis Query generator - Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

Reply via email to