Re: [Nutch-general] stemming

Howie Wang Thu, 27 Jul 2006 09:18:43 -0700

Hi,

The settings look reasonable. But for testing purposes, I would get rid of
the other query filters and put in some print statements in the
query-stemmer to see what's happening.


Howie

>In my nutch-site.xml I overrode the plugin.includes property as below:
>
><property>
>  <name>plugin.includes</name>
>  
><value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
>  <description>Regular expression naming plugin directory names to
>  include.  Any plugin not matching this expression is excluded.
>  In any case you need at least include the nutch-extensionpoints plugin. 
>By
>  default Nutch includes crawling just HTML and plain text via HTTP,
>  and basic indexing and search plugins.
>  </description>
></property>
>
>
>However, it is still only letting me search for the stemmed term (IE 
>"Interview" returns results but "interviewed" doesnt, even though thats the 
>word thats actually on the page).
>
>I tried a different approach and removed the query-stemmer value from 
>nutch-site.xml to attempt to disable the plugin. I reran the crawl and it 
>didn't load the plugin. However, it still had the same stemming 
>functionality. I'm guessing this is due to editing the main files such as 
>CommonGrams.java and NutchDocumentAnalyzer.java. Should I attempt too copy 
>the needed methods into StemmerQueryFilter.java and try to isolate all 
>functionality to the plugin alone?
>
>Thanks,
>    Matt
>
>Howie Wang wrote:
>>It sounds like the query-stemmer is not being called.
>>The query string "interviews" needs to be processed
>>into "interview". Are you sure that your nutch-default.xml
>>is including the query-stemmer correctly? Put print statements
>>in to see if it's getting there.
>>
>>By the way, someone recently told me that they
>>were able to put all the stemming code into an indexing
>>filter without touching any of the main code. All they
>>did was to copy some of the code that is being done
>>in NutchDocumentAnalyzer and CommonGrams into
>>their custom index filter. Haven't tried it myself.
>>
>>HTH
>>Howie
>>
>>>Ok. I did this for Nutch 0.8 (had to edit the listed code some to make up 
>>>for changes from .7.2 to .8 - mostly having to do with the Configuration 
>>>type being needed).
>>>
>>>It partially works.
>>>
>>>If the page I'm trying to index contains the word "interviews" and I type 
>>>in the search engine "interview", the stemming takes place and the page 
>>>with the word "interviews" is returned.
>>>However, if I type in the word "interviews" no page is returned. (The 
>>>page with the word interviews on it should be returned).
>>>
>>>Any ideas??
>>>Matt
>>>
>>>Dima Mazmanov wrote:
>>>>Hi, .
>>>>
>>>>I've gotten a couple of questions offlist about stemming
>>>>so I thought I'd just post here with my changes. Sorry that
>>>>some of the changes are in the main code and not in a plugin. It
>>>>seemed that it's more efficient to put in the main analyzer. It
>>>>would be nice if later releases could add support for plugging
>>>>in a custom stemmer/analyzer.
>>>>
>>>>The first change I made is in NutchDocumentAnalyzer.java.
>>>>
>>>>Import the following classes at the top of the file:
>>>>import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>import org.apache.lucene.analysis.LowerCaseFilter;
>>>>import org.apache.lucene.analysis.PorterStemFilter;
>>>>
>>>>Change tokenStream to:
>>>>
>>>>    public TokenStream tokenStream(String field, Reader reader) {
>>>>TokenStream ts = CommonGrams.getFilter(new 
>>>>NutchDocumentTokenizer(reader),
>>>>field);
>>>>if (field.equals("content") || field.equals("title")) {
>>>>     ts = new LowerCaseFilter(ts);
>>>>     return new PorterStemFilter(ts);
>>>>} else {
>>>>     return ts;
>>>>}
>>>>    }
>>>>
>>>>The second change is in CommonGrams.java.
>>>>Import the following classes near the top:
>>>>
>>>>import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>import org.apache.lucene.analysis.LowerCaseFilter;
>>>>import org.apache.lucene.analysis.PorterStemFilter;
>>>>
>>>>In optimizePhrase, after this line:
>>>>
>>>>    TokenStream ts = getFilter(new ArrayTokens(phrase), field);
>>>>
>>>>Add:
>>>>
>>>>    ts = new PorterStemFilter(new LowerCaseFilter(ts));
>>>>
>>>>And the rest is a new QueryFilter plugin that I'm calling query-stemmer.
>>>>Here's the full source for the Java file. You can copy the build.xml
>>>>and plugin.xml from query-basic, and alter the names for query-stemmer.
>>>>
>>>>/* Copyright (c) 2003 The Nutch Organization.  All rights reserved.   */
>>>>/* Use subject to the conditions in http://www.nutch.org/LICENSE.txt. */
>>>>
>>>>package org.apache.nutch.searcher.stemmer;
>>>>
>>>>import org.apache.lucene.search.BooleanQuery;
>>>>import org.apache.lucene.search.PhraseQuery;
>>>>import org.apache.lucene.search.TermQuery;
>>>>import org.apache.lucene.analysis.TokenFilter;
>>>>import org.apache.lucene.analysis.TokenStream;
>>>>import org.apache.lucene.analysis.Token;
>>>>import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>import org.apache.lucene.analysis.LowerCaseFilter;
>>>>import org.apache.lucene.analysis.PorterStemFilter;
>>>>
>>>>import org.apache.nutch.analysis.NutchDocumentAnalyzer;
>>>>import org.apache.nutch.analysis.CommonGrams;
>>>>
>>>>import org.apache.nutch.searcher.QueryFilter;
>>>>import org.apache.nutch.searcher.Query;
>>>>import org.apache.nutch.searcher.Query.*;
>>>>
>>>>import java.io.IOException;
>>>>import java.util.HashSet;
>>>>import java.io.StringReader;
>>>>
>>>>/** The default query filter.  Query terms in the default query field 
>>>>are
>>>>* expanded to search the url, anchor and content document fields.*/
>>>>public class StemmerQueryFilter implements QueryFilter {
>>>>
>>>>   private static float URL_BOOST = 4.0f;
>>>>   private static float ANCHOR_BOOST = 2.0f;
>>>>
>>>>   private static int SLOP = Integer.MAX_VALUE;
>>>>   private static float PHRASE_BOOST = 1.0f;
>>>>
>>>>   private static final String[] FIELDS = {"url", "anchor", "content",
>>>>"title"};
>>>>   private static final float[] FIELD_BOOSTS = {URL_BOOST, ANCHOR_BOOST,
>>>>1.0f, 2.0f};
>>>>
>>>>   /** Set the boost factor for url matches, relative to content and 
>>>>anchor
>>>>    * matches */
>>>>   public static void setUrlBoost(float boost) { URL_BOOST = boost; }
>>>>
>>>>   /** Set the boost factor for title/anchor matches, relative to url 
>>>>and
>>>>    * content matches. */
>>>>   public static void setAnchorBoost(float boost) { ANCHOR_BOOST = 
>>>>boost; }
>>>>
>>>>   /** Set the boost factor for sloppy phrase matches relative to 
>>>>unordered
>>>>term
>>>>    * matches. */
>>>>   public static void setPhraseBoost(float boost) { PHRASE_BOOST = 
>>>>boost; }
>>>>
>>>>   /** Set the maximum number of terms permitted between matching terms 
>>>>in a
>>>>    * sloppy phrase match. */
>>>>   public static void setSlop(int slop) { SLOP = slop; }
>>>>
>>>>   public BooleanQuery filter(Query input, BooleanQuery output) {
>>>>     addTerms(input, output);
>>>>     addSloppyPhrases(input, output);
>>>>     return output;
>>>>   }
>>>>
>>>>   private static void addTerms(Query input, BooleanQuery output) {
>>>>     Clause[] clauses = input.getClauses();
>>>>     for (int i = 0; i < clauses.length; i++) {
>>>>       Clause c = clauses[i];
>>>>
>>>>       if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>         continue;                                 // skip non-default 
>>>>fields
>>>>
>>>>       BooleanQuery out = new BooleanQuery();
>>>>       for (int f = 0; f < FIELDS.length; f++) {
>>>>
>>>>         Clause o = c;
>>>>         String[] opt;
>>>>
>>>>         // TODO: I'm a little nervous about stemming for all default 
>>>>fields.
>>>>         //       Should keep an eye on this.
>>>>         if (c.isPhrase()) {                         // optimize phrase
>>>>clauses
>>>>             opt = CommonGrams.optimizePhrase(c.getPhrase(), FIELDS[f]);
>>>>         } else {
>>>>             System.out.println("o.getTerm = " + 
>>>>o.getTerm().toString());
>>>>             opt = getStemmedWords(o.getTerm().toString());
>>>>         }
>>>>         if (opt.length==1) {
>>>>             o = new Clause(new Term(opt[0]), c.isRequired(),
>>>>c.isProhibited());
>>>>         } else {
>>>>             o = new Clause(new Phrase(opt), c.isRequired(),
>>>>c.isProhibited());
>>>>         }
>>>>
>>>>         out.add(o.isPhrase()
>>>>                 ? exactPhrase(o.getPhrase(), FIELDS[f], 
>>>>FIELD_BOOSTS[f])
>>>>                 : termQuery(FIELDS[f], o.getTerm(), FIELD_BOOSTS[f]),
>>>>                 false, false);
>>>>       }
>>>>       output.add(out, c.isRequired(), c.isProhibited());
>>>>     }
>>>>     System.out.println("query = " + output.toString());
>>>>   }
>>>>
>>>>     private static String[] getStemmedWords(String value) {
>>>>           StringReader sr = new StringReader(value);
>>>>           TokenStream ts = new PorterStemFilter(new 
>>>>LowerCaseTokenizer(sr));
>>>>
>>>>           String stemmedValue = "";
>>>>           try {
>>>>               Token token = ts.next();
>>>>               int count = 0;
>>>>               while (token != null) {
>>>>                   System.out.println("token = " + token.termText());
>>>>                   System.out.println("type = " + token.type());
>>>>
>>>>                   if (count == 0)
>>>>                       stemmedValue = token.termText();
>>>>                   else
>>>>                       stemmedValue = stemmedValue + " " + 
>>>>token.termText();
>>>>
>>>>                   token = ts.next();
>>>>                   count++;
>>>>               }
>>>>           } catch (Exception e) {
>>>>               stemmedValue = value;
>>>>           }
>>>>
>>>>           if (stemmedValue.equals("")) {
>>>>               stemmedValue = value;
>>>>           }
>>>>
>>>>           String[] stemmedValues = stemmedValue.split("\\s+");
>>>>
>>>>           for (int j=0; j<stemmedValues.length; j++) {
>>>>               System.out.println("stemmedValues = " + 
>>>>stemmedValues[j]);
>>>>           }
>>>>           return stemmedValues;
>>>>     }
>>>>
>>>>
>>>>   private static void addSloppyPhrases(Query input, BooleanQuery 
>>>>output) {
>>>>     Clause[] clauses = input.getClauses();
>>>>     for (int f = 0; f < FIELDS.length; f++) {
>>>>
>>>>       PhraseQuery sloppyPhrase = new PhraseQuery();
>>>>       sloppyPhrase.setBoost(FIELD_BOOSTS[f] * PHRASE_BOOST);
>>>>       sloppyPhrase.setSlop("anchor".equals(FIELDS[f])
>>>>                            ? NutchDocumentAnalyzer.INTER_ANCHOR_GAP
>>>>                            : SLOP);
>>>>       int sloppyTerms = 0;
>>>>
>>>>       for (int i = 0; i < clauses.length; i++) {
>>>>         Clause c = clauses[i];
>>>>
>>>>         if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>           continue;                               // skip non-default 
>>>>fields
>>>>
>>>>         if (c.isPhrase())                         // skip exact phrases
>>>>           continue;
>>>>
>>>>         if (c.isProhibited())                     // skip prohibited 
>>>>terms
>>>>           continue;
>>>>
>>>>         sloppyPhrase.add(luceneTerm(FIELDS[f], c.getTerm()));
>>>>         sloppyTerms++;
>>>>       }
>>>>
>>>>       if (sloppyTerms > 1)
>>>>         output.add(sloppyPhrase, false, false);
>>>>     }
>>>>   }
>>>>
>>>>
>>>>   private static org.apache.lucene.search.Query
>>>>         termQuery(String field, Term term, float boost) {
>>>>     TermQuery result = new TermQuery(luceneTerm(field, term));
>>>>     result.setBoost(boost);
>>>>     return result;
>>>>   }
>>>>
>>>>   /** Utility to construct a Lucene exact phrase query for a Nutch 
>>>>phrase.
>>>>*/
>>>>   private static org.apache.lucene.search.Query
>>>>        exactPhrase(Phrase nutchPhrase,
>>>>                    String field, float boost) {
>>>>     Term[] terms = nutchPhrase.getTerms();
>>>>     PhraseQuery exactPhrase = new PhraseQuery();
>>>>     for (int i = 0; i < terms.length; i++) {
>>>>       exactPhrase.add(luceneTerm(field, terms[i]));
>>>>     }
>>>>     exactPhrase.setBoost(boost);
>>>>     return exactPhrase;
>>>>   }
>>>>
>>>>   /** Utility to construct a Lucene Term given a Nutch query term and 
>>>>field.
>>>>*/
>>>>   private static org.apache.lucene.index.Term luceneTerm(String field,
>>>>                                                          Term term) {
>>>>     return new org.apache.lucene.index.Term(field, term.toString());
>>>>   }
>>>>}
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
>>



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] stemming

Reply via email to