Re: [Nutch-general] stemming - RESOLVED

Matthew Holt Mon, 31 Jul 2006 06:00:01 -0700

We could, although other than readability, it won't make any difference.

[EMAIL PROTECTED] wrote:
> Hi, Matthew
>
> I think we should use fieldName instead of field, or not...
>
> ===============stemming code begin=======================
>
> public TokenStream tokenStream(String field, Reader reader) {
>     Analyzer analyzer;
>     if ("anchor".equals(field)) {
>         analyzer = ANCHOR_ANALYZER;
>     }
>     else {
>         analyzer = CONTENT_ANALYZER;
>
>         TokenStream ts = analyzer.tokenStream(field, reader);
>         if (field.equals("content") || field.equals("title")) {
>             ts = new LowerCaseFilter(ts);
>             return new PorterStemFilter(ts);
>         }
>         else {
>             return ts;
>         }
>     }
> }
>
> ===============stemming code end=======================
>
> P.S. this patch doesn't take any effect on russian language.
>
> Regards,
> Alexey
>
> ------------------------------
>
> Howie,
>    Thanks for all the help configuring your stemming addon for version 
> 0.8. I compared query-basic and query-stemmer and the only new feature 
> that was added is a "host" boost. I made the changes and everything 
> works perfect.
>
> I uploaded the code to the wiki for both version 0.7.2 and 0.8. You can 
> access it at the below URL..
>
> http://wiki.apache.org/nutch/FAQ#head-fa0c678473eeecf3771e490b22d385054697232c
>
> Take care,
>   Matt
>
> Howie Wang wrote:
>   
>> Hi, Matt,
>>
>> In 0.7, you wouldn't miss anything. That code was written to
>> replace the basic query filter, and handled all the fields that
>> basic query filter was handling. For 0.8, I'm really not sure.
>> I'm guessing the code is fairly simple still in 0.8. You can probably
>> figure out if query-basic in 0.8 is doing something appreciably different
>> than query-stemmer by just visually comparing the files.
>>
>> Howie
>>
>>     
>>> Howie,
>>>  The query-stemmer works great as long as query-basic is not enabled. 
>>> However, if I don't have query-basic enabled, won't I be missing some 
>>> needed functionality?
>>>  Matt
>>>
>>> Howie Wang wrote:
>>>       
>>>> Hi,
>>>>
>>>> The settings look reasonable. But for testing purposes, I would get 
>>>> rid of
>>>> the other query filters and put in some print statements in the
>>>> query-stemmer to see what's happening.
>>>>
>>>> Howie
>>>>
>>>>         
>>>>> In my nutch-site.xml I overrode the plugin.includes property as below:
>>>>>
>>>>> <property>
>>>>>  <name>plugin.includes</name>
>>>>>  
>>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
>>>>>  
>>>>>
>>>>>
>>>>>  <description>Regular expression naming plugin directory names to
>>>>>  include.  Any plugin not matching this expression is excluded.
>>>>>  In any case you need at least include the nutch-extensionpoints 
>>>>> plugin. By
>>>>>  default Nutch includes crawling just HTML and plain text via HTTP,
>>>>>  and basic indexing and search plugins.
>>>>>  </description>
>>>>> </property>
>>>>>
>>>>>
>>>>> However, it is still only letting me search for the stemmed term 
>>>>> (IE "Interview" returns results but "interviewed" doesnt, even 
>>>>> though thats the word thats actually on the page).
>>>>>
>>>>> I tried a different approach and removed the query-stemmer value 
>>>>> from nutch-site.xml to attempt to disable the plugin. I reran the 
>>>>> crawl and it didn't load the plugin. However, it still had the same 
>>>>> stemming functionality. I'm guessing this is due to editing the 
>>>>> main files such as CommonGrams.java and NutchDocumentAnalyzer.java. 
>>>>> Should I attempt too copy the needed methods into 
>>>>> StemmerQueryFilter.java and try to isolate all functionality to the 
>>>>> plugin alone?
>>>>>
>>>>> Thanks,
>>>>>    Matt
>>>>>
>>>>> Howie Wang wrote:
>>>>>           
>>>>>> It sounds like the query-stemmer is not being called.
>>>>>> The query string "interviews" needs to be processed
>>>>>> into "interview". Are you sure that your nutch-default.xml
>>>>>> is including the query-stemmer correctly? Put print statements
>>>>>> in to see if it's getting there.
>>>>>>
>>>>>> By the way, someone recently told me that they
>>>>>> were able to put all the stemming code into an indexing
>>>>>> filter without touching any of the main code. All they
>>>>>> did was to copy some of the code that is being done
>>>>>> in NutchDocumentAnalyzer and CommonGrams into
>>>>>> their custom index filter. Haven't tried it myself.
>>>>>>
>>>>>> HTH
>>>>>> Howie
>>>>>>
>>>>>>             
>>>>>>> Ok. I did this for Nutch 0.8 (had to edit the listed code some to 
>>>>>>> make up for changes from .7.2 to .8 - mostly having to do with 
>>>>>>> the Configuration type being needed).
>>>>>>>
>>>>>>> It partially works.
>>>>>>>
>>>>>>> If the page I'm trying to index contains the word "interviews" 
>>>>>>> and I type in the search engine "interview", the stemming takes 
>>>>>>> place and the page with the word "interviews" is returned.
>>>>>>> However, if I type in the word "interviews" no page is returned. 
>>>>>>> (The page with the word interviews on it should be returned).
>>>>>>>
>>>>>>> Any ideas??
>>>>>>> Matt
>>>>>>>
>>>>>>> Dima Mazmanov wrote:
>>>>>>>               
>>>>>>>> Hi, .
>>>>>>>>
>>>>>>>> I've gotten a couple of questions offlist about stemming
>>>>>>>> so I thought I'd just post here with my changes. Sorry that
>>>>>>>> some of the changes are in the main code and not in a plugin. It
>>>>>>>> seemed that it's more efficient to put in the main analyzer. It
>>>>>>>> would be nice if later releases could add support for plugging
>>>>>>>> in a custom stemmer/analyzer.
>>>>>>>>
>>>>>>>> The first change I made is in NutchDocumentAnalyzer.java.
>>>>>>>>
>>>>>>>> Import the following classes at the top of the file:
>>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>>
>>>>>>>> Change tokenStream to:
>>>>>>>>
>>>>>>>>    public TokenStream tokenStream(String field, Reader reader) {
>>>>>>>> TokenStream ts = CommonGrams.getFilter(new 
>>>>>>>> NutchDocumentTokenizer(reader),
>>>>>>>> field);
>>>>>>>> if (field.equals("content") || field.equals("title")) {
>>>>>>>>     ts = new LowerCaseFilter(ts);
>>>>>>>>     return new PorterStemFilter(ts);
>>>>>>>> } else {
>>>>>>>>     return ts;
>>>>>>>> }
>>>>>>>>    }
>>>>>>>>
>>>>>>>> The second change is in CommonGrams.java.
>>>>>>>> Import the following classes near the top:
>>>>>>>>
>>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>>
>>>>>>>> In optimizePhrase, after this line:
>>>>>>>>
>>>>>>>>    TokenStream ts = getFilter(new ArrayTokens(phrase), field);
>>>>>>>>
>>>>>>>> Add:
>>>>>>>>
>>>>>>>>    ts = new PorterStemFilter(new LowerCaseFilter(ts));
>>>>>>>>
>>>>>>>> And the rest is a new QueryFilter plugin that I'm calling 
>>>>>>>> query-stemmer.
>>>>>>>> Here's the full source for the Java file. You can copy the 
>>>>>>>> build.xml
>>>>>>>> and plugin.xml from query-basic, and alter the names for 
>>>>>>>> query-stemmer.
>>>>>>>>
>>>>>>>> /* Copyright (c) 2003 The Nutch Organization.  All rights 
>>>>>>>> reserved.   */
>>>>>>>> /* Use subject to the conditions in 
>>>>>>>> http://www.nutch.org/LICENSE.txt. */
>>>>>>>>
>>>>>>>> package org.apache.nutch.searcher.stemmer;
>>>>>>>>
>>>>>>>> import org.apache.lucene.search.BooleanQuery;
>>>>>>>> import org.apache.lucene.search.PhraseQuery;
>>>>>>>> import org.apache.lucene.search.TermQuery;
>>>>>>>> import org.apache.lucene.analysis.TokenFilter;
>>>>>>>> import org.apache.lucene.analysis.TokenStream;
>>>>>>>> import org.apache.lucene.analysis.Token;
>>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>>
>>>>>>>> import org.apache.nutch.analysis.NutchDocumentAnalyzer;
>>>>>>>> import org.apache.nutch.analysis.CommonGrams;
>>>>>>>>
>>>>>>>> import org.apache.nutch.searcher.QueryFilter;
>>>>>>>> import org.apache.nutch.searcher.Query;
>>>>>>>> import org.apache.nutch.searcher.Query.*;
>>>>>>>>
>>>>>>>> import java.io.IOException;
>>>>>>>> import java.util.HashSet;
>>>>>>>> import java.io.StringReader;
>>>>>>>>
>>>>>>>> /** The default query filter.  Query terms in the default query 
>>>>>>>> field are
>>>>>>>> * expanded to search the url, anchor and content document fields.*/
>>>>>>>> public class StemmerQueryFilter implements QueryFilter {
>>>>>>>>
>>>>>>>>   private static float URL_BOOST = 4.0f;
>>>>>>>>   private static float ANCHOR_BOOST = 2.0f;
>>>>>>>>
>>>>>>>>   private static int SLOP = Integer.MAX_VALUE;
>>>>>>>>   private static float PHRASE_BOOST = 1.0f;
>>>>>>>>
>>>>>>>>   private static final String[] FIELDS = {"url", "anchor", 
>>>>>>>> "content",
>>>>>>>> "title"};
>>>>>>>>   private static final float[] FIELD_BOOSTS = {URL_BOOST, 
>>>>>>>> ANCHOR_BOOST,
>>>>>>>> 1.0f, 2.0f};
>>>>>>>>
>>>>>>>>   /** Set the boost factor for url matches, relative to content 
>>>>>>>> and anchor
>>>>>>>>    * matches */
>>>>>>>>   public static void setUrlBoost(float boost) { URL_BOOST = 
>>>>>>>> boost; }
>>>>>>>>
>>>>>>>>   /** Set the boost factor for title/anchor matches, relative to 
>>>>>>>> url and
>>>>>>>>    * content matches. */
>>>>>>>>   public static void setAnchorBoost(float boost) { ANCHOR_BOOST 
>>>>>>>> = boost; }
>>>>>>>>
>>>>>>>>   /** Set the boost factor for sloppy phrase matches relative to 
>>>>>>>> unordered
>>>>>>>> term
>>>>>>>>    * matches. */
>>>>>>>>   public static void setPhraseBoost(float boost) { PHRASE_BOOST 
>>>>>>>> = boost; }
>>>>>>>>
>>>>>>>>   /** Set the maximum number of terms permitted between matching 
>>>>>>>> terms in a
>>>>>>>>    * sloppy phrase match. */
>>>>>>>>   public static void setSlop(int slop) { SLOP = slop; }
>>>>>>>>
>>>>>>>>   public BooleanQuery filter(Query input, BooleanQuery output) {
>>>>>>>>     addTerms(input, output);
>>>>>>>>     addSloppyPhrases(input, output);
>>>>>>>>     return output;
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   private static void addTerms(Query input, BooleanQuery output) {
>>>>>>>>     Clause[] clauses = input.getClauses();
>>>>>>>>     for (int i = 0; i < clauses.length; i++) {
>>>>>>>>       Clause c = clauses[i];
>>>>>>>>
>>>>>>>>       if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>>>         continue;                                 // skip 
>>>>>>>> non-default fields
>>>>>>>>
>>>>>>>>       BooleanQuery out = new BooleanQuery();
>>>>>>>>       for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>>
>>>>>>>>         Clause o = c;
>>>>>>>>         String[] opt;
>>>>>>>>
>>>>>>>>         // TODO: I'm a little nervous about stemming for all 
>>>>>>>> default fields.
>>>>>>>>         //       Should keep an eye on this.
>>>>>>>>         if (c.isPhrase()) {                         // optimize 
>>>>>>>> phrase
>>>>>>>> clauses
>>>>>>>>             opt = CommonGrams.optimizePhrase(c.getPhrase(), 
>>>>>>>> FIELDS[f]);
>>>>>>>>         } else {
>>>>>>>>             System.out.println("o.getTerm = " + 
>>>>>>>> o.getTerm().toString());
>>>>>>>>             opt = getStemmedWords(o.getTerm().toString());
>>>>>>>>         }
>>>>>>>>         if (opt.length==1) {
>>>>>>>>             o = new Clause(new Term(opt[0]), c.isRequired(),
>>>>>>>> c.isProhibited());
>>>>>>>>         } else {
>>>>>>>>             o = new Clause(new Phrase(opt), c.isRequired(),
>>>>>>>> c.isProhibited());
>>>>>>>>         }
>>>>>>>>
>>>>>>>>         out.add(o.isPhrase()
>>>>>>>>                 ? exactPhrase(o.getPhrase(), FIELDS[f], 
>>>>>>>> FIELD_BOOSTS[f])
>>>>>>>>                 : termQuery(FIELDS[f], o.getTerm(), 
>>>>>>>> FIELD_BOOSTS[f]),
>>>>>>>>                 false, false);
>>>>>>>>       }
>>>>>>>>       output.add(out, c.isRequired(), c.isProhibited());
>>>>>>>>     }
>>>>>>>>     System.out.println("query = " + output.toString());
>>>>>>>>   }
>>>>>>>>
>>>>>>>>     private static String[] getStemmedWords(String value) {
>>>>>>>>           StringReader sr = new StringReader(value);
>>>>>>>>           TokenStream ts = new PorterStemFilter(new 
>>>>>>>> LowerCaseTokenizer(sr));
>>>>>>>>
>>>>>>>>           String stemmedValue = "";
>>>>>>>>           try {
>>>>>>>>               Token token = ts.next();
>>>>>>>>               int count = 0;
>>>>>>>>               while (token != null) {
>>>>>>>>                   System.out.println("token = " + 
>>>>>>>> token.termText());
>>>>>>>>                   System.out.println("type = " + token.type());
>>>>>>>>
>>>>>>>>                   if (count == 0)
>>>>>>>>                       stemmedValue = token.termText();
>>>>>>>>                   else
>>>>>>>>                       stemmedValue = stemmedValue + " " + 
>>>>>>>> token.termText();
>>>>>>>>
>>>>>>>>                   token = ts.next();
>>>>>>>>                   count++;
>>>>>>>>               }
>>>>>>>>           } catch (Exception e) {
>>>>>>>>               stemmedValue = value;
>>>>>>>>           }
>>>>>>>>
>>>>>>>>           if (stemmedValue.equals("")) {
>>>>>>>>               stemmedValue = value;
>>>>>>>>           }
>>>>>>>>
>>>>>>>>           String[] stemmedValues = stemmedValue.split("\\s+");
>>>>>>>>
>>>>>>>>           for (int j=0; j<stemmedValues.length; j++) {
>>>>>>>>               System.out.println("stemmedValues = " + 
>>>>>>>> stemmedValues[j]);
>>>>>>>>           }
>>>>>>>>           return stemmedValues;
>>>>>>>>     }
>>>>>>>>
>>>>>>>>
>>>>>>>>   private static void addSloppyPhrases(Query input, BooleanQuery 
>>>>>>>> output) {
>>>>>>>>     Clause[] clauses = input.getClauses();
>>>>>>>>     for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>>
>>>>>>>>       PhraseQuery sloppyPhrase = new PhraseQuery();
>>>>>>>>       sloppyPhrase.setBoost(FIELD_BOOSTS[f] * PHRASE_BOOST);
>>>>>>>>       sloppyPhrase.setSlop("anchor".equals(FIELDS[f])
>>>>>>>>                            ? NutchDocumentAnalyzer.INTER_ANCHOR_GAP
>>>>>>>>                            : SLOP);
>>>>>>>>       int sloppyTerms = 0;
>>>>>>>>
>>>>>>>>       for (int i = 0; i < clauses.length; i++) {
>>>>>>>>         Clause c = clauses[i];
>>>>>>>>
>>>>>>>>         if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>>>           continue;                               // skip 
>>>>>>>> non-default fields
>>>>>>>>
>>>>>>>>         if (c.isPhrase())                         // skip exact 
>>>>>>>> phrases
>>>>>>>>           continue;
>>>>>>>>
>>>>>>>>         if (c.isProhibited())                     // skip 
>>>>>>>> prohibited terms
>>>>>>>>           continue;
>>>>>>>>
>>>>>>>>         sloppyPhrase.add(luceneTerm(FIELDS[f], c.getTerm()));
>>>>>>>>         sloppyTerms++;
>>>>>>>>       }
>>>>>>>>
>>>>>>>>       if (sloppyTerms > 1)
>>>>>>>>         output.add(sloppyPhrase, false, false);
>>>>>>>>     }
>>>>>>>>   }
>>>>>>>>
>>>>>>>>
>>>>>>>>   private static org.apache.lucene.search.Query
>>>>>>>>         termQuery(String field, Term term, float boost) {
>>>>>>>>     TermQuery result = new TermQuery(luceneTerm(field, term));
>>>>>>>>     result.setBoost(boost);
>>>>>>>>     return result;
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   /** Utility to construct a Lucene exact phrase query for a 
>>>>>>>> Nutch phrase.
>>>>>>>> */
>>>>>>>>   private static org.apache.lucene.search.Query
>>>>>>>>        exactPhrase(Phrase nutchPhrase,
>>>>>>>>                    String field, float boost) {
>>>>>>>>     Term[] terms = nutchPhrase.getTerms();
>>>>>>>>     PhraseQuery exactPhrase = new PhraseQuery();
>>>>>>>>     for (int i = 0; i < terms.length; i++) {
>>>>>>>>       exactPhrase.add(luceneTerm(field, terms[i]));
>>>>>>>>     }
>>>>>>>>     exactPhrase.setBoost(boost);
>>>>>>>>     return exactPhrase;
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   /** Utility to construct a Lucene Term given a Nutch query 
>>>>>>>> term and field.
>>>>>>>> */
>>>>>>>>   private static org.apache.lucene.index.Term luceneTerm(String 
>>>>>>>> field,
>>>>>>>>                                                          Term 
>>>>>>>> term) {
>>>>>>>>     return new org.apache.lucene.index.Term(field, 
>>>>>>>> term.toString());
>>>>>>>>   }
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>
>>>>>>             
>>>>
>>>>         
>>     
>
>
>


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] stemming - RESOLVED

Reply via email to