Re: [Nutch-general] stemming - RESOLVED

Matthew Holt Fri, 28 Jul 2006 12:04:30 -0700

Howie,
   Thanks for all the help configuring your stemming addon for version 
0.8. I compared query-basic and query-stemmer and the only new feature 
that was added is a "host" boost. I made the changes and everything 
works perfect.


I uploaded the code to the wiki for both version 0.7.2 and 0.8. You can 
access it at the below URL..

http://wiki.apache.org/nutch/FAQ#head-fa0c678473eeecf3771e490b22d385054697232c

Take care,
  Matt

Howie Wang wrote:
> Hi, Matt,
>
> In 0.7, you wouldn't miss anything. That code was written to
> replace the basic query filter, and handled all the fields that
> basic query filter was handling. For 0.8, I'm really not sure.
> I'm guessing the code is fairly simple still in 0.8. You can probably
> figure out if query-basic in 0.8 is doing something appreciably different
> than query-stemmer by just visually comparing the files.
>
> Howie
>
>> Howie,
>>  The query-stemmer works great as long as query-basic is not enabled. 
>> However, if I don't have query-basic enabled, won't I be missing some 
>> needed functionality?
>>  Matt
>>
>> Howie Wang wrote:
>>> Hi,
>>>
>>> The settings look reasonable. But for testing purposes, I would get 
>>> rid of
>>> the other query filters and put in some print statements in the
>>> query-stemmer to see what's happening.
>>>
>>> Howie
>>>
>>>> In my nutch-site.xml I overrode the plugin.includes property as below:
>>>>
>>>> <property>
>>>>  <name>plugin.includes</name>
>>>>  
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
>>>>  
>>>>
>>>>
>>>>  <description>Regular expression naming plugin directory names to
>>>>  include.  Any plugin not matching this expression is excluded.
>>>>  In any case you need at least include the nutch-extensionpoints 
>>>> plugin. By
>>>>  default Nutch includes crawling just HTML and plain text via HTTP,
>>>>  and basic indexing and search plugins.
>>>>  </description>
>>>> </property>
>>>>
>>>>
>>>> However, it is still only letting me search for the stemmed term 
>>>> (IE "Interview" returns results but "interviewed" doesnt, even 
>>>> though thats the word thats actually on the page).
>>>>
>>>> I tried a different approach and removed the query-stemmer value 
>>>> from nutch-site.xml to attempt to disable the plugin. I reran the 
>>>> crawl and it didn't load the plugin. However, it still had the same 
>>>> stemming functionality. I'm guessing this is due to editing the 
>>>> main files such as CommonGrams.java and NutchDocumentAnalyzer.java. 
>>>> Should I attempt too copy the needed methods into 
>>>> StemmerQueryFilter.java and try to isolate all functionality to the 
>>>> plugin alone?
>>>>
>>>> Thanks,
>>>>    Matt
>>>>
>>>> Howie Wang wrote:
>>>>> It sounds like the query-stemmer is not being called.
>>>>> The query string "interviews" needs to be processed
>>>>> into "interview". Are you sure that your nutch-default.xml
>>>>> is including the query-stemmer correctly? Put print statements
>>>>> in to see if it's getting there.
>>>>>
>>>>> By the way, someone recently told me that they
>>>>> were able to put all the stemming code into an indexing
>>>>> filter without touching any of the main code. All they
>>>>> did was to copy some of the code that is being done
>>>>> in NutchDocumentAnalyzer and CommonGrams into
>>>>> their custom index filter. Haven't tried it myself.
>>>>>
>>>>> HTH
>>>>> Howie
>>>>>
>>>>>> Ok. I did this for Nutch 0.8 (had to edit the listed code some to 
>>>>>> make up for changes from .7.2 to .8 - mostly having to do with 
>>>>>> the Configuration type being needed).
>>>>>>
>>>>>> It partially works.
>>>>>>
>>>>>> If the page I'm trying to index contains the word "interviews" 
>>>>>> and I type in the search engine "interview", the stemming takes 
>>>>>> place and the page with the word "interviews" is returned.
>>>>>> However, if I type in the word "interviews" no page is returned. 
>>>>>> (The page with the word interviews on it should be returned).
>>>>>>
>>>>>> Any ideas??
>>>>>> Matt
>>>>>>
>>>>>> Dima Mazmanov wrote:
>>>>>>> Hi, .
>>>>>>>
>>>>>>> I've gotten a couple of questions offlist about stemming
>>>>>>> so I thought I'd just post here with my changes. Sorry that
>>>>>>> some of the changes are in the main code and not in a plugin. It
>>>>>>> seemed that it's more efficient to put in the main analyzer. It
>>>>>>> would be nice if later releases could add support for plugging
>>>>>>> in a custom stemmer/analyzer.
>>>>>>>
>>>>>>> The first change I made is in NutchDocumentAnalyzer.java.
>>>>>>>
>>>>>>> Import the following classes at the top of the file:
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> Change tokenStream to:
>>>>>>>
>>>>>>>    public TokenStream tokenStream(String field, Reader reader) {
>>>>>>> TokenStream ts = CommonGrams.getFilter(new 
>>>>>>> NutchDocumentTokenizer(reader),
>>>>>>> field);
>>>>>>> if (field.equals("content") || field.equals("title")) {
>>>>>>>     ts = new LowerCaseFilter(ts);
>>>>>>>     return new PorterStemFilter(ts);
>>>>>>> } else {
>>>>>>>     return ts;
>>>>>>> }
>>>>>>>    }
>>>>>>>
>>>>>>> The second change is in CommonGrams.java.
>>>>>>> Import the following classes near the top:
>>>>>>>
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> In optimizePhrase, after this line:
>>>>>>>
>>>>>>>    TokenStream ts = getFilter(new ArrayTokens(phrase), field);
>>>>>>>
>>>>>>> Add:
>>>>>>>
>>>>>>>    ts = new PorterStemFilter(new LowerCaseFilter(ts));
>>>>>>>
>>>>>>> And the rest is a new QueryFilter plugin that I'm calling 
>>>>>>> query-stemmer.
>>>>>>> Here's the full source for the Java file. You can copy the 
>>>>>>> build.xml
>>>>>>> and plugin.xml from query-basic, and alter the names for 
>>>>>>> query-stemmer.
>>>>>>>
>>>>>>> /* Copyright (c) 2003 The Nutch Organization.  All rights 
>>>>>>> reserved.   */
>>>>>>> /* Use subject to the conditions in 
>>>>>>> http://www.nutch.org/LICENSE.txt. */
>>>>>>>
>>>>>>> package org.apache.nutch.searcher.stemmer;
>>>>>>>
>>>>>>> import org.apache.lucene.search.BooleanQuery;
>>>>>>> import org.apache.lucene.search.PhraseQuery;
>>>>>>> import org.apache.lucene.search.TermQuery;
>>>>>>> import org.apache.lucene.analysis.TokenFilter;
>>>>>>> import org.apache.lucene.analysis.TokenStream;
>>>>>>> import org.apache.lucene.analysis.Token;
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> import org.apache.nutch.analysis.NutchDocumentAnalyzer;
>>>>>>> import org.apache.nutch.analysis.CommonGrams;
>>>>>>>
>>>>>>> import org.apache.nutch.searcher.QueryFilter;
>>>>>>> import org.apache.nutch.searcher.Query;
>>>>>>> import org.apache.nutch.searcher.Query.*;
>>>>>>>
>>>>>>> import java.io.IOException;
>>>>>>> import java.util.HashSet;
>>>>>>> import java.io.StringReader;
>>>>>>>
>>>>>>> /** The default query filter.  Query terms in the default query 
>>>>>>> field are
>>>>>>> * expanded to search the url, anchor and content document fields.*/
>>>>>>> public class StemmerQueryFilter implements QueryFilter {
>>>>>>>
>>>>>>>   private static float URL_BOOST = 4.0f;
>>>>>>>   private static float ANCHOR_BOOST = 2.0f;
>>>>>>>
>>>>>>>   private static int SLOP = Integer.MAX_VALUE;
>>>>>>>   private static float PHRASE_BOOST = 1.0f;
>>>>>>>
>>>>>>>   private static final String[] FIELDS = {"url", "anchor", 
>>>>>>> "content",
>>>>>>> "title"};
>>>>>>>   private static final float[] FIELD_BOOSTS = {URL_BOOST, 
>>>>>>> ANCHOR_BOOST,
>>>>>>> 1.0f, 2.0f};
>>>>>>>
>>>>>>>   /** Set the boost factor for url matches, relative to content 
>>>>>>> and anchor
>>>>>>>    * matches */
>>>>>>>   public static void setUrlBoost(float boost) { URL_BOOST = 
>>>>>>> boost; }
>>>>>>>
>>>>>>>   /** Set the boost factor for title/anchor matches, relative to 
>>>>>>> url and
>>>>>>>    * content matches. */
>>>>>>>   public static void setAnchorBoost(float boost) { ANCHOR_BOOST 
>>>>>>> = boost; }
>>>>>>>
>>>>>>>   /** Set the boost factor for sloppy phrase matches relative to 
>>>>>>> unordered
>>>>>>> term
>>>>>>>    * matches. */
>>>>>>>   public static void setPhraseBoost(float boost) { PHRASE_BOOST 
>>>>>>> = boost; }
>>>>>>>
>>>>>>>   /** Set the maximum number of terms permitted between matching 
>>>>>>> terms in a
>>>>>>>    * sloppy phrase match. */
>>>>>>>   public static void setSlop(int slop) { SLOP = slop; }
>>>>>>>
>>>>>>>   public BooleanQuery filter(Query input, BooleanQuery output) {
>>>>>>>     addTerms(input, output);
>>>>>>>     addSloppyPhrases(input, output);
>>>>>>>     return output;
>>>>>>>   }
>>>>>>>
>>>>>>>   private static void addTerms(Query input, BooleanQuery output) {
>>>>>>>     Clause[] clauses = input.getClauses();
>>>>>>>     for (int i = 0; i < clauses.length; i++) {
>>>>>>>       Clause c = clauses[i];
>>>>>>>
>>>>>>>       if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>>         continue;                                 // skip 
>>>>>>> non-default fields
>>>>>>>
>>>>>>>       BooleanQuery out = new BooleanQuery();
>>>>>>>       for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>
>>>>>>>         Clause o = c;
>>>>>>>         String[] opt;
>>>>>>>
>>>>>>>         // TODO: I'm a little nervous about stemming for all 
>>>>>>> default fields.
>>>>>>>         //       Should keep an eye on this.
>>>>>>>         if (c.isPhrase()) {                         // optimize 
>>>>>>> phrase
>>>>>>> clauses
>>>>>>>             opt = CommonGrams.optimizePhrase(c.getPhrase(), 
>>>>>>> FIELDS[f]);
>>>>>>>         } else {
>>>>>>>             System.out.println("o.getTerm = " + 
>>>>>>> o.getTerm().toString());
>>>>>>>             opt = getStemmedWords(o.getTerm().toString());
>>>>>>>         }
>>>>>>>         if (opt.length==1) {
>>>>>>>             o = new Clause(new Term(opt[0]), c.isRequired(),
>>>>>>> c.isProhibited());
>>>>>>>         } else {
>>>>>>>             o = new Clause(new Phrase(opt), c.isRequired(),
>>>>>>> c.isProhibited());
>>>>>>>         }
>>>>>>>
>>>>>>>         out.add(o.isPhrase()
>>>>>>>                 ? exactPhrase(o.getPhrase(), FIELDS[f], 
>>>>>>> FIELD_BOOSTS[f])
>>>>>>>                 : termQuery(FIELDS[f], o.getTerm(), 
>>>>>>> FIELD_BOOSTS[f]),
>>>>>>>                 false, false);
>>>>>>>       }
>>>>>>>       output.add(out, c.isRequired(), c.isProhibited());
>>>>>>>     }
>>>>>>>     System.out.println("query = " + output.toString());
>>>>>>>   }
>>>>>>>
>>>>>>>     private static String[] getStemmedWords(String value) {
>>>>>>>           StringReader sr = new StringReader(value);
>>>>>>>           TokenStream ts = new PorterStemFilter(new 
>>>>>>> LowerCaseTokenizer(sr));
>>>>>>>
>>>>>>>           String stemmedValue = "";
>>>>>>>           try {
>>>>>>>               Token token = ts.next();
>>>>>>>               int count = 0;
>>>>>>>               while (token != null) {
>>>>>>>                   System.out.println("token = " + 
>>>>>>> token.termText());
>>>>>>>                   System.out.println("type = " + token.type());
>>>>>>>
>>>>>>>                   if (count == 0)
>>>>>>>                       stemmedValue = token.termText();
>>>>>>>                   else
>>>>>>>                       stemmedValue = stemmedValue + " " + 
>>>>>>> token.termText();
>>>>>>>
>>>>>>>                   token = ts.next();
>>>>>>>                   count++;
>>>>>>>               }
>>>>>>>           } catch (Exception e) {
>>>>>>>               stemmedValue = value;
>>>>>>>           }
>>>>>>>
>>>>>>>           if (stemmedValue.equals("")) {
>>>>>>>               stemmedValue = value;
>>>>>>>           }
>>>>>>>
>>>>>>>           String[] stemmedValues = stemmedValue.split("\\s+");
>>>>>>>
>>>>>>>           for (int j=0; j<stemmedValues.length; j++) {
>>>>>>>               System.out.println("stemmedValues = " + 
>>>>>>> stemmedValues[j]);
>>>>>>>           }
>>>>>>>           return stemmedValues;
>>>>>>>     }
>>>>>>>
>>>>>>>
>>>>>>>   private static void addSloppyPhrases(Query input, BooleanQuery 
>>>>>>> output) {
>>>>>>>     Clause[] clauses = input.getClauses();
>>>>>>>     for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>
>>>>>>>       PhraseQuery sloppyPhrase = new PhraseQuery();
>>>>>>>       sloppyPhrase.setBoost(FIELD_BOOSTS[f] * PHRASE_BOOST);
>>>>>>>       sloppyPhrase.setSlop("anchor".equals(FIELDS[f])
>>>>>>>                            ? NutchDocumentAnalyzer.INTER_ANCHOR_GAP
>>>>>>>                            : SLOP);
>>>>>>>       int sloppyTerms = 0;
>>>>>>>
>>>>>>>       for (int i = 0; i < clauses.length; i++) {
>>>>>>>         Clause c = clauses[i];
>>>>>>>
>>>>>>>         if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>>           continue;                               // skip 
>>>>>>> non-default fields
>>>>>>>
>>>>>>>         if (c.isPhrase())                         // skip exact 
>>>>>>> phrases
>>>>>>>           continue;
>>>>>>>
>>>>>>>         if (c.isProhibited())                     // skip 
>>>>>>> prohibited terms
>>>>>>>           continue;
>>>>>>>
>>>>>>>         sloppyPhrase.add(luceneTerm(FIELDS[f], c.getTerm()));
>>>>>>>         sloppyTerms++;
>>>>>>>       }
>>>>>>>
>>>>>>>       if (sloppyTerms > 1)
>>>>>>>         output.add(sloppyPhrase, false, false);
>>>>>>>     }
>>>>>>>   }
>>>>>>>
>>>>>>>
>>>>>>>   private static org.apache.lucene.search.Query
>>>>>>>         termQuery(String field, Term term, float boost) {
>>>>>>>     TermQuery result = new TermQuery(luceneTerm(field, term));
>>>>>>>     result.setBoost(boost);
>>>>>>>     return result;
>>>>>>>   }
>>>>>>>
>>>>>>>   /** Utility to construct a Lucene exact phrase query for a 
>>>>>>> Nutch phrase.
>>>>>>> */
>>>>>>>   private static org.apache.lucene.search.Query
>>>>>>>        exactPhrase(Phrase nutchPhrase,
>>>>>>>                    String field, float boost) {
>>>>>>>     Term[] terms = nutchPhrase.getTerms();
>>>>>>>     PhraseQuery exactPhrase = new PhraseQuery();
>>>>>>>     for (int i = 0; i < terms.length; i++) {
>>>>>>>       exactPhrase.add(luceneTerm(field, terms[i]));
>>>>>>>     }
>>>>>>>     exactPhrase.setBoost(boost);
>>>>>>>     return exactPhrase;
>>>>>>>   }
>>>>>>>
>>>>>>>   /** Utility to construct a Lucene Term given a Nutch query 
>>>>>>> term and field.
>>>>>>> */
>>>>>>>   private static org.apache.lucene.index.Term luceneTerm(String 
>>>>>>> field,
>>>>>>>                                                          Term 
>>>>>>> term) {
>>>>>>>     return new org.apache.lucene.index.Term(field, 
>>>>>>> term.toString());
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] stemming - RESOLVED

Reply via email to