Howie, Thanks for all the help configuring your stemming addon for version 0.8. I compared query-basic and query-stemmer and the only new feature that was added is a "host" boost. I made the changes and everything works perfect.
I uploaded the code to the wiki for both version 0.7.2 and 0.8. You can access it at the below URL.. http://wiki.apache.org/nutch/FAQ#head-fa0c678473eeecf3771e490b22d385054697232c Take care, Matt Howie Wang wrote: > Hi, Matt, > > In 0.7, you wouldn't miss anything. That code was written to > replace the basic query filter, and handled all the fields that > basic query filter was handling. For 0.8, I'm really not sure. > I'm guessing the code is fairly simple still in 0.8. You can probably > figure out if query-basic in 0.8 is doing something appreciably different > than query-stemmer by just visually comparing the files. > > Howie > >> Howie, >> The query-stemmer works great as long as query-basic is not enabled. >> However, if I don't have query-basic enabled, won't I be missing some >> needed functionality? >> Matt >> >> Howie Wang wrote: >>> Hi, >>> >>> The settings look reasonable. But for testing purposes, I would get >>> rid of >>> the other query filters and put in some print statements in the >>> query-stemmer to see what's happening. >>> >>> Howie >>> >>>> In my nutch-site.xml I overrode the plugin.includes property as below: >>>> >>>> <property> >>>> <name>plugin.includes</name> >>>> >>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value> >>>> >>>> >>>> >>>> <description>Regular expression naming plugin directory names to >>>> include. Any plugin not matching this expression is excluded. >>>> In any case you need at least include the nutch-extensionpoints >>>> plugin. By >>>> default Nutch includes crawling just HTML and plain text via HTTP, >>>> and basic indexing and search plugins. >>>> </description> >>>> </property> >>>> >>>> >>>> However, it is still only letting me search for the stemmed term >>>> (IE "Interview" returns results but "interviewed" doesnt, even >>>> though thats the word thats actually on the page). >>>> >>>> I tried a different approach and removed the query-stemmer value >>>> from nutch-site.xml to attempt to disable the plugin. I reran the >>>> crawl and it didn't load the plugin. However, it still had the same >>>> stemming functionality. I'm guessing this is due to editing the >>>> main files such as CommonGrams.java and NutchDocumentAnalyzer.java. >>>> Should I attempt too copy the needed methods into >>>> StemmerQueryFilter.java and try to isolate all functionality to the >>>> plugin alone? >>>> >>>> Thanks, >>>> Matt >>>> >>>> Howie Wang wrote: >>>>> It sounds like the query-stemmer is not being called. >>>>> The query string "interviews" needs to be processed >>>>> into "interview". Are you sure that your nutch-default.xml >>>>> is including the query-stemmer correctly? Put print statements >>>>> in to see if it's getting there. >>>>> >>>>> By the way, someone recently told me that they >>>>> were able to put all the stemming code into an indexing >>>>> filter without touching any of the main code. All they >>>>> did was to copy some of the code that is being done >>>>> in NutchDocumentAnalyzer and CommonGrams into >>>>> their custom index filter. Haven't tried it myself. >>>>> >>>>> HTH >>>>> Howie >>>>> >>>>>> Ok. I did this for Nutch 0.8 (had to edit the listed code some to >>>>>> make up for changes from .7.2 to .8 - mostly having to do with >>>>>> the Configuration type being needed). >>>>>> >>>>>> It partially works. >>>>>> >>>>>> If the page I'm trying to index contains the word "interviews" >>>>>> and I type in the search engine "interview", the stemming takes >>>>>> place and the page with the word "interviews" is returned. >>>>>> However, if I type in the word "interviews" no page is returned. >>>>>> (The page with the word interviews on it should be returned). >>>>>> >>>>>> Any ideas?? >>>>>> Matt >>>>>> >>>>>> Dima Mazmanov wrote: >>>>>>> Hi, . >>>>>>> >>>>>>> I've gotten a couple of questions offlist about stemming >>>>>>> so I thought I'd just post here with my changes. Sorry that >>>>>>> some of the changes are in the main code and not in a plugin. It >>>>>>> seemed that it's more efficient to put in the main analyzer. It >>>>>>> would be nice if later releases could add support for plugging >>>>>>> in a custom stemmer/analyzer. >>>>>>> >>>>>>> The first change I made is in NutchDocumentAnalyzer.java. >>>>>>> >>>>>>> Import the following classes at the top of the file: >>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer; >>>>>>> import org.apache.lucene.analysis.LowerCaseFilter; >>>>>>> import org.apache.lucene.analysis.PorterStemFilter; >>>>>>> >>>>>>> Change tokenStream to: >>>>>>> >>>>>>> public TokenStream tokenStream(String field, Reader reader) { >>>>>>> TokenStream ts = CommonGrams.getFilter(new >>>>>>> NutchDocumentTokenizer(reader), >>>>>>> field); >>>>>>> if (field.equals("content") || field.equals("title")) { >>>>>>> ts = new LowerCaseFilter(ts); >>>>>>> return new PorterStemFilter(ts); >>>>>>> } else { >>>>>>> return ts; >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> The second change is in CommonGrams.java. >>>>>>> Import the following classes near the top: >>>>>>> >>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer; >>>>>>> import org.apache.lucene.analysis.LowerCaseFilter; >>>>>>> import org.apache.lucene.analysis.PorterStemFilter; >>>>>>> >>>>>>> In optimizePhrase, after this line: >>>>>>> >>>>>>> TokenStream ts = getFilter(new ArrayTokens(phrase), field); >>>>>>> >>>>>>> Add: >>>>>>> >>>>>>> ts = new PorterStemFilter(new LowerCaseFilter(ts)); >>>>>>> >>>>>>> And the rest is a new QueryFilter plugin that I'm calling >>>>>>> query-stemmer. >>>>>>> Here's the full source for the Java file. You can copy the >>>>>>> build.xml >>>>>>> and plugin.xml from query-basic, and alter the names for >>>>>>> query-stemmer. >>>>>>> >>>>>>> /* Copyright (c) 2003 The Nutch Organization. All rights >>>>>>> reserved. */ >>>>>>> /* Use subject to the conditions in >>>>>>> http://www.nutch.org/LICENSE.txt. */ >>>>>>> >>>>>>> package org.apache.nutch.searcher.stemmer; >>>>>>> >>>>>>> import org.apache.lucene.search.BooleanQuery; >>>>>>> import org.apache.lucene.search.PhraseQuery; >>>>>>> import org.apache.lucene.search.TermQuery; >>>>>>> import org.apache.lucene.analysis.TokenFilter; >>>>>>> import org.apache.lucene.analysis.TokenStream; >>>>>>> import org.apache.lucene.analysis.Token; >>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer; >>>>>>> import org.apache.lucene.analysis.LowerCaseFilter; >>>>>>> import org.apache.lucene.analysis.PorterStemFilter; >>>>>>> >>>>>>> import org.apache.nutch.analysis.NutchDocumentAnalyzer; >>>>>>> import org.apache.nutch.analysis.CommonGrams; >>>>>>> >>>>>>> import org.apache.nutch.searcher.QueryFilter; >>>>>>> import org.apache.nutch.searcher.Query; >>>>>>> import org.apache.nutch.searcher.Query.*; >>>>>>> >>>>>>> import java.io.IOException; >>>>>>> import java.util.HashSet; >>>>>>> import java.io.StringReader; >>>>>>> >>>>>>> /** The default query filter. Query terms in the default query >>>>>>> field are >>>>>>> * expanded to search the url, anchor and content document fields.*/ >>>>>>> public class StemmerQueryFilter implements QueryFilter { >>>>>>> >>>>>>> private static float URL_BOOST = 4.0f; >>>>>>> private static float ANCHOR_BOOST = 2.0f; >>>>>>> >>>>>>> private static int SLOP = Integer.MAX_VALUE; >>>>>>> private static float PHRASE_BOOST = 1.0f; >>>>>>> >>>>>>> private static final String[] FIELDS = {"url", "anchor", >>>>>>> "content", >>>>>>> "title"}; >>>>>>> private static final float[] FIELD_BOOSTS = {URL_BOOST, >>>>>>> ANCHOR_BOOST, >>>>>>> 1.0f, 2.0f}; >>>>>>> >>>>>>> /** Set the boost factor for url matches, relative to content >>>>>>> and anchor >>>>>>> * matches */ >>>>>>> public static void setUrlBoost(float boost) { URL_BOOST = >>>>>>> boost; } >>>>>>> >>>>>>> /** Set the boost factor for title/anchor matches, relative to >>>>>>> url and >>>>>>> * content matches. */ >>>>>>> public static void setAnchorBoost(float boost) { ANCHOR_BOOST >>>>>>> = boost; } >>>>>>> >>>>>>> /** Set the boost factor for sloppy phrase matches relative to >>>>>>> unordered >>>>>>> term >>>>>>> * matches. */ >>>>>>> public static void setPhraseBoost(float boost) { PHRASE_BOOST >>>>>>> = boost; } >>>>>>> >>>>>>> /** Set the maximum number of terms permitted between matching >>>>>>> terms in a >>>>>>> * sloppy phrase match. */ >>>>>>> public static void setSlop(int slop) { SLOP = slop; } >>>>>>> >>>>>>> public BooleanQuery filter(Query input, BooleanQuery output) { >>>>>>> addTerms(input, output); >>>>>>> addSloppyPhrases(input, output); >>>>>>> return output; >>>>>>> } >>>>>>> >>>>>>> private static void addTerms(Query input, BooleanQuery output) { >>>>>>> Clause[] clauses = input.getClauses(); >>>>>>> for (int i = 0; i < clauses.length; i++) { >>>>>>> Clause c = clauses[i]; >>>>>>> >>>>>>> if (!c.getField().equals(Clause.DEFAULT_FIELD)) >>>>>>> continue; // skip >>>>>>> non-default fields >>>>>>> >>>>>>> BooleanQuery out = new BooleanQuery(); >>>>>>> for (int f = 0; f < FIELDS.length; f++) { >>>>>>> >>>>>>> Clause o = c; >>>>>>> String[] opt; >>>>>>> >>>>>>> // TODO: I'm a little nervous about stemming for all >>>>>>> default fields. >>>>>>> // Should keep an eye on this. >>>>>>> if (c.isPhrase()) { // optimize >>>>>>> phrase >>>>>>> clauses >>>>>>> opt = CommonGrams.optimizePhrase(c.getPhrase(), >>>>>>> FIELDS[f]); >>>>>>> } else { >>>>>>> System.out.println("o.getTerm = " + >>>>>>> o.getTerm().toString()); >>>>>>> opt = getStemmedWords(o.getTerm().toString()); >>>>>>> } >>>>>>> if (opt.length==1) { >>>>>>> o = new Clause(new Term(opt[0]), c.isRequired(), >>>>>>> c.isProhibited()); >>>>>>> } else { >>>>>>> o = new Clause(new Phrase(opt), c.isRequired(), >>>>>>> c.isProhibited()); >>>>>>> } >>>>>>> >>>>>>> out.add(o.isPhrase() >>>>>>> ? exactPhrase(o.getPhrase(), FIELDS[f], >>>>>>> FIELD_BOOSTS[f]) >>>>>>> : termQuery(FIELDS[f], o.getTerm(), >>>>>>> FIELD_BOOSTS[f]), >>>>>>> false, false); >>>>>>> } >>>>>>> output.add(out, c.isRequired(), c.isProhibited()); >>>>>>> } >>>>>>> System.out.println("query = " + output.toString()); >>>>>>> } >>>>>>> >>>>>>> private static String[] getStemmedWords(String value) { >>>>>>> StringReader sr = new StringReader(value); >>>>>>> TokenStream ts = new PorterStemFilter(new >>>>>>> LowerCaseTokenizer(sr)); >>>>>>> >>>>>>> String stemmedValue = ""; >>>>>>> try { >>>>>>> Token token = ts.next(); >>>>>>> int count = 0; >>>>>>> while (token != null) { >>>>>>> System.out.println("token = " + >>>>>>> token.termText()); >>>>>>> System.out.println("type = " + token.type()); >>>>>>> >>>>>>> if (count == 0) >>>>>>> stemmedValue = token.termText(); >>>>>>> else >>>>>>> stemmedValue = stemmedValue + " " + >>>>>>> token.termText(); >>>>>>> >>>>>>> token = ts.next(); >>>>>>> count++; >>>>>>> } >>>>>>> } catch (Exception e) { >>>>>>> stemmedValue = value; >>>>>>> } >>>>>>> >>>>>>> if (stemmedValue.equals("")) { >>>>>>> stemmedValue = value; >>>>>>> } >>>>>>> >>>>>>> String[] stemmedValues = stemmedValue.split("\\s+"); >>>>>>> >>>>>>> for (int j=0; j<stemmedValues.length; j++) { >>>>>>> System.out.println("stemmedValues = " + >>>>>>> stemmedValues[j]); >>>>>>> } >>>>>>> return stemmedValues; >>>>>>> } >>>>>>> >>>>>>> >>>>>>> private static void addSloppyPhrases(Query input, BooleanQuery >>>>>>> output) { >>>>>>> Clause[] clauses = input.getClauses(); >>>>>>> for (int f = 0; f < FIELDS.length; f++) { >>>>>>> >>>>>>> PhraseQuery sloppyPhrase = new PhraseQuery(); >>>>>>> sloppyPhrase.setBoost(FIELD_BOOSTS[f] * PHRASE_BOOST); >>>>>>> sloppyPhrase.setSlop("anchor".equals(FIELDS[f]) >>>>>>> ? NutchDocumentAnalyzer.INTER_ANCHOR_GAP >>>>>>> : SLOP); >>>>>>> int sloppyTerms = 0; >>>>>>> >>>>>>> for (int i = 0; i < clauses.length; i++) { >>>>>>> Clause c = clauses[i]; >>>>>>> >>>>>>> if (!c.getField().equals(Clause.DEFAULT_FIELD)) >>>>>>> continue; // skip >>>>>>> non-default fields >>>>>>> >>>>>>> if (c.isPhrase()) // skip exact >>>>>>> phrases >>>>>>> continue; >>>>>>> >>>>>>> if (c.isProhibited()) // skip >>>>>>> prohibited terms >>>>>>> continue; >>>>>>> >>>>>>> sloppyPhrase.add(luceneTerm(FIELDS[f], c.getTerm())); >>>>>>> sloppyTerms++; >>>>>>> } >>>>>>> >>>>>>> if (sloppyTerms > 1) >>>>>>> output.add(sloppyPhrase, false, false); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> >>>>>>> private static org.apache.lucene.search.Query >>>>>>> termQuery(String field, Term term, float boost) { >>>>>>> TermQuery result = new TermQuery(luceneTerm(field, term)); >>>>>>> result.setBoost(boost); >>>>>>> return result; >>>>>>> } >>>>>>> >>>>>>> /** Utility to construct a Lucene exact phrase query for a >>>>>>> Nutch phrase. >>>>>>> */ >>>>>>> private static org.apache.lucene.search.Query >>>>>>> exactPhrase(Phrase nutchPhrase, >>>>>>> String field, float boost) { >>>>>>> Term[] terms = nutchPhrase.getTerms(); >>>>>>> PhraseQuery exactPhrase = new PhraseQuery(); >>>>>>> for (int i = 0; i < terms.length; i++) { >>>>>>> exactPhrase.add(luceneTerm(field, terms[i])); >>>>>>> } >>>>>>> exactPhrase.setBoost(boost); >>>>>>> return exactPhrase; >>>>>>> } >>>>>>> >>>>>>> /** Utility to construct a Lucene Term given a Nutch query >>>>>>> term and field. >>>>>>> */ >>>>>>> private static org.apache.lucene.index.Term luceneTerm(String >>>>>>> field, >>>>>>> Term >>>>>>> term) { >>>>>>> return new org.apache.lucene.index.Term(field, >>>>>>> term.toString()); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>>>> >>> >>> >>> > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
