NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

David Spencer Tue, 14 Sep 2004 12:23:33 -0700

Andrzej Bialecki wrote:

David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead look for terms that also match on the 1st "n" (prob 3) chars.
...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be fast, and you could provide a "did you mean" function too...

Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word.

Let's see.

[1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 and 4 based on the existing index I have of approx 100k javadoc-generated pages. You type in a misspelled word like "recursixe" or whatnot to see what suggestions it returns. Note this is not a normal search index query -- rather this is a test page for spelling corrections.

http://www.searchmorph.com/kat/spell.jsp

[3] Here's the javadoc:

http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html

[4] Here's source in HTML:

http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152

[5] A few more details:

Based on a subsequent mail in this thread I set boosts for the words in the ngram index. The background is each word (er..term for a given field) in the orig index is a separate Document in the ngram index. This Doc contains all ngrams (in my test case, like #2 above, of length 3 and 4) of the word. I also set a boost of log(word_freq)/log(num_docs) so that more frequently words will tend to be suggested more often.

I think in "plain" English then the way a word is suggested as a spelling correction is: - frequently occuring words score higher - words that share more ngrams with the orig word score higher - words that share rare ngrams with the orig word score higher

[6]

If people want to vote me in as a committer to the sandbox then I can check this code in - though again, I'd appreciate feedback.

thx,
 Dave

package org.apache.lucene.spell;


/* ====================================================================
 * The Apache Software License, Version 1.1
 *
 * Copyright (c) 2001-2003 The Apache Software Foundation.  All rights
 * reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in
 *    the documentation and/or other materials provided with the
 *    distribution.
 *
 * 3. The end-user documentation included with the redistribution,
 *    if any, must include the following acknowledgment:
 *       "This product includes software developed by the
 *        Apache Software Foundation (http://www.apache.org/)."
 *    Alternately, this acknowledgment may appear in the software itself,
 *    if and wherever such third-party acknowledgments normally appear.
 *
 * 4. The names "Apache" and "Apache Software Foundation" and
 *    "Apache Lucene" must not be used to endorse or promote products
 *    derived from this software without prior written permission. For
 *    written permission, please contact [EMAIL PROTECTED]
 *
 * 5. Products derived from this software may not be called "Apache",
 *    "Apache Lucene", nor may "Apache" appear in their name, without
 *    prior written permission of the Apache Software Foundation.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
 * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 * ====================================================================
 *
 * This software consists of voluntary contributions made by many
 * individuals on behalf of the Apache Software Foundation.  For more
 * information on the Apache Software Foundation, please see
 * <http://www.apache.org/>.
 */

import org.apache.lucene.analysis.*;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.document.*;
import org.apache.lucene.util.*;
import org.apache.lucene.search.*;
import org.apache.lucene.queryParser.*;

import java.io.*;
import java.util.*;
import java.text.*;

/**
 * Do spelling correction based on ngram frequency of terms in an index.
 *
 * Developed based on <a 
href="http://marc.theaimsgroup.com/?l=lucene-user&m=109474652805339&w=2";>this 
message</a> in the lucene-user list.
 *
 * <p>
 * There are two parts to this algorithm.
 * First a ngram lookup table is formed for all terms in an index.
 * Then suggested spelling corrections can be done based on this table.
 * <p>
 * The "lookup table" is actually another Lucene index.
 * It is built by going through all terms in
 * your original index and storing the term in a Document with all ngrams that make it 
up.
 * Ngrams of length 3 and 4 are suggested.
 * <p>
 *
 * In addition the prefix and suffix ngrams are stored in case you want to use a 
heuristic that people usually
 * know the first few characters of a word.
 *
 * <p>
 * The entry's boost is set by default to log(word_freq)/log(num_docs).
 *
 * <p>
 *
 * For a word like "kings" a [EMAIL PROTECTED] Document} with the following fields is 
made in the ngram index:
 *
 * <pre>
 * word:kings
 * gram3:kin
 * gram3:ing
 * gram3:ngs
 * gram4:king
 * gram4:ings
 * start3:kin
 * start4:king
 * end3:ngs
 * end4:ings
 *
 * boost: log(freq('kings'))/log(num_docs).
 * </pre>
 *
 *
 * When a lookup is done a query is formed with all ngrams in the misspelled word.
 *
 * <p>
 * For a word like <code>"kingz"</code> a query is formed like this.
 *
 * Query: <br>
 * <code>
 * gram3:kin gram3:ing gram3:ngz start3:kin^B1 end3:ngz^B2 start4:king^B1 end4:ingz^B2
 * </code>
 * <br>
 * 
 * Above B1 and B2 are the prefix and suffix boosts.
 * The prefix boost should probably be &gt;= 2.0 and the suffix boost should probably 
be just a little above 1. 
 *
 * <p>
 * <b>To build</b> the ngram index based on the "contents" field in an existing index 
'orig_index' you run the main() driver like this:<br>
 * <code>
 * java org.apache.lucene.spell.NGramSpeller -f contents -i orig_index -o ngram_index
 * </code>
 *
 * <p>
 * Once you build an index you can <b>perform spelling corrections using</b>
 * [EMAIL PROTECTED] #suggestUsingNGrams suggestUsingNGrams(...)}.
 * 
 *
 * <p>
 *
 * To play around with the code against an index approx 100k javadoc-generated web 
pages circa Sept/2004
 * go here: 
 * <a 
href='http://www.searchmorph.com/kat/spell.jsp'>http://www.searchmorph.com/kat/spell.jsp</a>.
 *
 * <p>
 * Of interest might be the <a 
href="http://secondstring.sourceforge.net/";>secondstring</a> string matching package
 * and
 * <a 
href="http://specialist.nlm.nih.gov/nls/gspell/doc/apiDoc/overview-summary.html";>gspell</a>.
 *
 * 
 *
 * @author <a href="mailto:dave&#64;tropo.com?subject=NGramSpeller";>David Spencer</a>
 *
 */
public final class NGramSpeller
{
        /**
         *
         */
        private NGramSpeller()
        {
        }
        
        /**
         * Main driver, used to build an index.
         * You probably want invoke like this:
         * <br>
         * <code>
         * java org.apache.lucene.spell.NGramSpeller -f contents -i orig_index -o 
ngram_index
         * </code>
         */
        public static void main(String[] args)
                throws Throwable
        {
                int minThreshold = 5;
                int ng1 = 3;
                int ng2 = 4;
                int maxr = 10;
                int maxd = 5;
                String out = "gram_index";
                String gi = "gram_index";
                
                String name = null;
                String field = "contents";
                for ( int i = 0; i< args.length; i++)
                {
                        if ( args[i].equals( "-i"))
                                name = args[ ++i];
                        else if ( args[i].equals( "-gi"))
                                gi = args[ ++i];                        
                        else if ( args[i].equals( "-o"))
                                out = args[ ++i];
                        else if ( args[i].equals( "-q"))
                        {
                                String goal = args[ ++i];
                                o.println( "[NGrams] for " + goal + " from " + gi);

                                float bStart = 1.1f;
                                float bEnd = 1.0f;

                                IndexReader ir = IndexReader.open( gi);
                                IndexSearcher searcher = new IndexSearcher( gi);
                                List lis = new ArrayList( maxr);
                                String[] res = suggestUsingNGrams( searcher,
                                                                                       
            goal,
                                                                                       
            ng1, ng2,
                                                                                       
            maxr, bStart, bEnd, maxd, lis);
                                Iterator it = lis.iterator();
                                while ( it.hasNext())
                                {
                                        Object[] ar = (Object[]) it.next();
                                        String word = (String) ar[ 0];
                                        o.println( "word=" + ar[ 0] + " score=" + ar[ 
1] + " dist=" + ar[ 2] +
                                                           " :: " +
                                                           ir.docFreq( new Term( 
"title", word)) +
                                                           " : " +
                                                           ir.docFreq( new Term( 
"contents", word)));
                                }
                                searcher.close();
                                ir.close();
                                return;
                        }
                        else if ( args[i].equals( "-f"))
                                field = args[ ++i];
                        else
                        {
                                o.println( "hmm? " + args[ i]);
                                System.exit( 1);
                        }
                }
                if ( name == null)
                {
                        o.println( "opps");
                        System.exit( 1);
                }
                o.println( "Opening "+ name);
                IndexReader.unlock( FSDirectory.getDirectory( name, false));
                final IndexReader r = IndexReader.open( name);


                o.println( "Docs: " + nf.format( r.numDocs()));
                o.println( "Using field: " + field);
                
                IndexWriter writer = new IndexWriter( out,
                                                                                       
   new WhitespaceAnalyzer(),
                                                                                       
   true);
                writer.mergeFactor *= 50;
                writer.minMergeDocs *= 50;

                o.println( "Forming index from " + name + " to " + out);
                int res = formNGramIndex( r, writer,
                                                                  ng1, ng2,
                                                                  field,
                                                                  minThreshold);
                
                o.println( "done, did "+ res + " ngrams");
                writer.optimize();
                writer.close();
                r.close();              
        }


        /**
         * Using an NGram algorithm try to find alternate spellings for a "goal" word 
based on the ngrams in it.
         *
         * @param searcher the searcher for the "ngram" index
         *
         * @param goal the word you want a spell check done on
         *
         * @param ng1 the min ngram length to use, probably 3 and it defaults to 3 if 
you pass in a value &lt;= 0
         *
         * @param ng2 the max ngram length to use, probably 3 or 4
         *
         * @param maxr max results to return, probably a small number like 5 for 
normal use or 10-100 for testing
         *
         * @param bStart how to boost matches that start the same way as the goal 
word, probably greater than 2
         *
         * @param bEnd how to boost matches that end the same way as the goal word, 
probably greater than or equal to 1
         *
         * @param maxd filter for the max Levenshtein string distance for matches, 
probably a number like 3, 4, or 5, or use 0 for it to be ignored. This prevents words 
radically longer but similar to the goal word from being returned.
         *
         * @param details if non null is a list with one entry per match.
         * Each entry is an array of ([String] word, [Double] score, [Integer] 
Levenshtein string distance).
         *
         * @return the strings suggested with the best one first
         */
        public static String[] suggestUsingNGrams( Searcher searcher,
                                                                                       
    String goal,
                                                                                       
    int ng1,
                                                                                       
    int ng2,
                                                                                       
    int maxr,
                                                                                       
    float bStart,
                                                                                       
    float bEnd,
                                                                                       
    int maxd,
                                                                                       
    List details)
                throws Throwable
        {
                List res = new ArrayList( maxr);
                BooleanQuery query = new BooleanQuery();
                if ( ng1 <= 0)   ng1 = 3; // guess
                if ( ng2 < ng1)  ng2 = ng1;
                if ( bStart < 0) bStart = 0; 
                if ( bEnd < 0)   bEnd = 0;

                TRStringDistance sd = new TRStringDistance( goal);
                for ( int ng = ng1; ng <= ng2; ng++) // for every ngram in range
                {
                        String[] grams = formGrams( goal, ng); // form word into 
ngrams (allow dups too)

                        if ( grams.length == 0) continue; // hmm
                        
                        String key = "gram" + ng; // form key
                        
                        if ( bStart > 0) // should we boost prefixes?
                                add( query, "start"+ng, grams[ 0], bStart); // matches 
start of word
                        
                        if ( bEnd > 0) // should we boost suffixes
                                add( query, "end"+ng, grams[grams.length-1], bEnd); // 
matches end of word
                        
                        // match ngrams anywhere, w/o a boost
                        for ( int i = 0; i < grams.length; i++)
                        {
                                add( query, key, grams[ i]);
                        }
                }
                Hits hits = searcher.search(query);
                int len = hits.length();
                int remain = maxr;
                int stop = Math.min( len, 10*maxr); // go thru more than 'maxr' 
matches in case the distance filter triggers
                for (int i = 0; i < Math.min( len, maxr) && remain > 0; i++)
                {
                        Document d = hits.doc( i);
                        String word = d.get( F_WORD); // get orig word
                        int dist = sd.getDistance( word); // use distance filter
                        if ( maxd > 0 && dist > maxd) continue;

                        remain--;
                        res.add( word);
                        if ( details != null) // only non-null for testing probably
                                details.add( new Object[] { word,
                                                                                       
 new Double( hits.score( i)),
                                                                                       
 new Integer( dist)});                   
                }
                return (String[]) res.toArray( new String[ 0]);
        }

        /**
         * Add a clause to a boolean query.
         */     
        private static void add( BooleanQuery q, String k, String v)
        {
                q.add( new BooleanClause( new TermQuery( new Term( k, v)), false, 
false));
        }



        /**
         * Go thru all terms and form an index of the "ngrams" of length 'ng1' to 
'ng2' in each term.
         * The ngrams have field names like "gram3" for a 3 char ngram, and "gram4" 
for a 4 char one.
         * The starting and ending (or prefix and suffix) "n" characters are also
         * stored for each word with field names "start3" and "end3".
         *
         *
         * @param r the index to read terms from
         *
         * @param w the writer to write the ngrams to, or if null an index named 
"gram_index" will be created. If you pass in non-null then you should optimize and 
close the index.
         *
         * @param ng1 the min number of chars to form ngrams with (3 is suggested)
         *
         * @param ng2 the max number of chars to form ngrams with, can be equal to ng1
         *
         * @param fields the field name to process ngrams from.
         *
         * @param minThreshold terms must appear in at least this many docs else 
they're ignored as the assumption is that they're so rare (...)
         *
         * @return the number of ngrams added
         *
         */
        private static int formNGramIndex( IndexReader r,
                                                                           IndexWriter 
_w,
                                                                           int ng1,
                                                                           int ng2,
                                                                           String 
field,
                                                                           int 
minThreshold)
                throws IOException
        {
                int mins = 0;
                float nudge = 0.01f; // don't allow boosts to be too small
                IndexWriter w;
                if ( _w == null)
                        w = new IndexWriter( "gram_index",
                                                                 new 
WhitespaceAnalyzer(), // should have no effect
                                                                 true);
                else
                        w = _w;
                
                int mod = 10000; // for status
                int nd = r.numDocs();
                final float base = (float) Math.log( 1.0d / ((double)nd));

                if ( field == null)
                        field = "contents"; // def field
                field = field.intern(); // is it doced that you can use == on fields?

                int grams = 0; // # of ngrams added
                final TermEnum te = r.terms();
                int n = 0;
                int skips = 0;
                while ( te.next())
                {
                        boolean show = false; // for debugging
                        Term t = te.term();
                        int df = te.docFreq();
                        if ( (++n % mod) == 0)
                        {
                                show = true;
                                o.println( "term: " + t + " n=" + nf.format( n) + " 
grams=" +nf.format( grams) +
                                                   " mins="+ nf.format( mins) +
                                                   " skip=" + nf.format( skips) +
                                                   " docFreq=" + df);
                        }
                        if ( df < minThreshold) // not freq enough, too rare to 
consider
                        {
                                mins++;
                                continue;
                        }
                        String have = t.field();
                        if ( have != field && ! have.equals( field))
                        {
                                skips++;
                                continue;
                        }
                        
                        String text = t.text();
                        int len = text.length();
                        if ( len < ng1) continue; // too short we bail, "too long" is 
fine

                        Document doc = new Document();
                        doc.add( Field.Keyword( F_WORD, text)); // orig term


                        // now loop thru all ngrams of lengths 'ng1' to 'ng2'
                        for ( int ng = ng1; ng <= ng2; ng++)
                        {
                                String key = "gram" + ng;
                                String end = null;
                                for ( int i = 0; i < len-ng+1; i++)
                                {
                                        String gram = text.substring( i, i+ng);
                                        doc.add( Field.Keyword( key, gram));
                                        if ( i == 0)
                                                doc.add( Field.Keyword( "start"+ng, 
gram));
                                        end = gram;
                                        grams++; 
                                }
                                if ( end != null) // may not be present if len==ng1 
                                        doc.add( Field.Keyword( "end"+ng, end));
                        }

                        float f1 = te.docFreq();
                        float f2 = nd;

                        float bo = (float)((Math.log( f1) / Math.log( f2)) + nudge);
                        doc.setBoost( bo);
                        if ( show)
                                o.println( "f1=" +f1 + " nd=" +nd +
                                                   " boost=" +bo +
                                                   " base=" + base +
                                                   " word=" + text);
                        w.addDocument( doc);
                }
                if ( _w == null) // else you have to optimize/close
                {
                        w.optimize();
                        w.close();
                }
                return grams;
        }
        
        /**
         * Add a clause to a boolean query.
         */     
        private static void add( BooleanQuery q, String k, String v, float boost)
        {
                Query tq = new TermQuery( new Term( k, v));
                tq.setBoost( boost);
                q.add( new BooleanClause( tq, false, false));
        }       
        
        /**
         * Form all ngrams for a given word.
         * @param text the word to parse
         * @param ng the ngram length e.g. 3
         * @return an array of all ngrams in the word and note that duplicates are not 
removed
         */
        public static String[] formGrams( String text, int ng)
        {
                List res = new ArrayList( text.length()-ng+1);
                int len = text.length();
                StringBuffer sb = new StringBuffer( ng);
                
                for ( int i = 0; i < len-ng+1; i++)
                {
                        res.add( text.substring( i, i+ng));
                }
                return (String[]) res.toArray( new String[ 0]);
        }

        /**
         * Field name for each word in the ngram index.
         */     
        public static final String F_WORD = "word";

        /**
         *
         */
        private static final PrintStream o = System.out;
        
        /**
         *
         */
        private static final NumberFormat nf = NumberFormat.getInstance();

        
        /**
         * Presumably this is implemented somewhere in the apache/jakarta/commons area 
but I couldn't find it.
         *
         * @link http://www.merriampark.com/ld.htm
         *
         */
        private static class TRStringDistance
        {
                
                final char[] sa;
                final int n;
                final int[][][] cache = new int[ 30][][];
        
                /**
                 * Optimized to run a bit faster than the static getDistance().
                 * In one benchmark times were 5.3sec using ctr vs 8.5sec w/ static 
method, thus 37% faster.
                 */
                private TRStringDistance( String target)
                {
                        sa = target.toCharArray();
                        n = sa.length;          
                }

                //*****************************
                // Compute Levenshtein distance
                //*****************************
                public int getDistance( String other)
                {
                        int d[][]; // matrix
                        int cost; // cost

                        // Step 1
                        final char[] ta = other.toCharArray();
                        final int m = ta.length;
                        if (n == 0) {
                                return m;
                        }
                        if (m == 0) {
                                return n;
                        }

                        if ( m >= cache.length)
                                d = form( n, m);
                        else if ( cache[ m] != null)
                                d = cache[ m];
                        else
                                d = cache[ m] = form( n, m);

                        // Step 3

                        for (int i = 1; i <= n; i++)
                        {
                                final char s_i = sa[ i - 1];

                                // Step 4

                                for (int j = 1; j <= m; j++)
                                {
                                        final char t_j = ta[ j - 1];

                                        // Step 5

                                        if (s_i == t_j)  // same
                                                cost = 0;
                                        else // not a match
                                                cost = 1;

                                        // Step 6

                                        d[i][j] = min3( d[i-1][j]+1, d[i][j-1]+1, 
d[i-1][j-1] + cost);

                                }

                        }

                        // Step 7

                        return d[n][m];

                }

                /**
                 *
                 */
                private static int[][] form( int n, int m)
                {
                        int[][] d = new int[n+1][m+1];
                        // Step 2

                        for (int i = 0; i <= n; i++) 
                                d[i][0] = i;

                        for (int j = 0; j <= m; j++) 
                                d[0][j] = j;
                        return d;
                }
        
                //****************************
                // Get minimum of three values
                //****************************
                private static int min3( int a, int b, int c)
                {
                        int mi;

                        mi = a;
                        if (b < mi) {
                                mi = b;
                        }
                        if (c < mi) {
                                mi = c;
                        }
                        return mi;

                }

                //*****************************
                // Compute Levenshtein distance
                //*****************************
                public static int getDistance( String s, String t)
                {
                        return getDistance( s.toCharArray(), t.toCharArray());
                }

                //*****************************
                // Compute Levenshtein distance
                //*****************************
                public static int getDistance( final char[] sa, final char[] ta)
                {
                        int d[][]; // matrix
                        int i; // iterates through s
                        int j; // iterates through t
                        char s_i; // ith character of s
                        char t_j; // jth character of t
                        int cost; // cost

                        // Step 1

                        final int n = sa.length;
                        final int m = ta.length;
                        if (n == 0) {
                                return m;
                        }
                        if (m == 0) {
                                return n;
                        }
                        d = new int[n+1][m+1];

                        // Step 2

                        for (i = 0; i <= n; i++) {
                                d[i][0] = i;
                        }

                        for (j = 0; j <= m; j++) {
                                d[0][j] = j;
                        }

                        // Step 3

                        for (i = 1; i <= n; i++) {

                                s_i = sa[ i - 1];

                                // Step 4

                                for (j = 1; j <= m; j++) {

                                        t_j = ta[ j - 1];

                                        // Step 5

                                        if (s_i == t_j) {
                                                cost = 0;
                                        }
                                        else {
                                                cost = 1;
                                        }

                                        // Step 6

                                        d[i][j] = min3( d[i-1][j]+1, d[i][j-1]+1, 
d[i-1][j-1] + cost);

                                }

                        }

                        // Step 7

                        return d[n][m];

                }
        }

}

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Reply via email to