Hi,
The settings look reasonable. But for testing purposes, I would get rid of
the other query filters and put in some print statements in the
query-stemmer to see what's happening.
Howie
>In my nutch-site.xml I overrode the plugin.includes property as below:
>
><property>
> <name>plugin.includes</name>
>
><value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
> <description>Regular expression naming plugin directory names to
> include. Any plugin not matching this expression is excluded.
> In any case you need at least include the nutch-extensionpoints plugin.
>By
> default Nutch includes crawling just HTML and plain text via HTTP,
> and basic indexing and search plugins.
> </description>
></property>
>
>
>However, it is still only letting me search for the stemmed term (IE
>"Interview" returns results but "interviewed" doesnt, even though thats the
>word thats actually on the page).
>
>I tried a different approach and removed the query-stemmer value from
>nutch-site.xml to attempt to disable the plugin. I reran the crawl and it
>didn't load the plugin. However, it still had the same stemming
>functionality. I'm guessing this is due to editing the main files such as
>CommonGrams.java and NutchDocumentAnalyzer.java. Should I attempt too copy
>the needed methods into StemmerQueryFilter.java and try to isolate all
>functionality to the plugin alone?
>
>Thanks,
> Matt
>
>Howie Wang wrote:
>>It sounds like the query-stemmer is not being called.
>>The query string "interviews" needs to be processed
>>into "interview". Are you sure that your nutch-default.xml
>>is including the query-stemmer correctly? Put print statements
>>in to see if it's getting there.
>>
>>By the way, someone recently told me that they
>>were able to put all the stemming code into an indexing
>>filter without touching any of the main code. All they
>>did was to copy some of the code that is being done
>>in NutchDocumentAnalyzer and CommonGrams into
>>their custom index filter. Haven't tried it myself.
>>
>>HTH
>>Howie
>>
>>>Ok. I did this for Nutch 0.8 (had to edit the listed code some to make up
>>>for changes from .7.2 to .8 - mostly having to do with the Configuration
>>>type being needed).
>>>
>>>It partially works.
>>>
>>>If the page I'm trying to index contains the word "interviews" and I type
>>>in the search engine "interview", the stemming takes place and the page
>>>with the word "interviews" is returned.
>>>However, if I type in the word "interviews" no page is returned. (The
>>>page with the word interviews on it should be returned).
>>>
>>>Any ideas??
>>>Matt
>>>
>>>Dima Mazmanov wrote:
>>>>Hi, .
>>>>
>>>>I've gotten a couple of questions offlist about stemming
>>>>so I thought I'd just post here with my changes. Sorry that
>>>>some of the changes are in the main code and not in a plugin. It
>>>>seemed that it's more efficient to put in the main analyzer. It
>>>>would be nice if later releases could add support for plugging
>>>>in a custom stemmer/analyzer.
>>>>
>>>>The first change I made is in NutchDocumentAnalyzer.java.
>>>>
>>>>Import the following classes at the top of the file:
>>>>import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>import org.apache.lucene.analysis.LowerCaseFilter;
>>>>import org.apache.lucene.analysis.PorterStemFilter;
>>>>
>>>>Change tokenStream to:
>>>>
>>>> public TokenStream tokenStream(String field, Reader reader) {
>>>>TokenStream ts = CommonGrams.getFilter(new
>>>>NutchDocumentTokenizer(reader),
>>>>field);
>>>>if (field.equals("content") || field.equals("title")) {
>>>> ts = new LowerCaseFilter(ts);
>>>> return new PorterStemFilter(ts);
>>>>} else {
>>>> return ts;
>>>>}
>>>> }
>>>>
>>>>The second change is in CommonGrams.java.
>>>>Import the following classes near the top:
>>>>
>>>>import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>import org.apache.lucene.analysis.LowerCaseFilter;
>>>>import org.apache.lucene.analysis.PorterStemFilter;
>>>>
>>>>In optimizePhrase, after this line:
>>>>
>>>> TokenStream ts = getFilter(new ArrayTokens(phrase), field);
>>>>
>>>>Add:
>>>>
>>>> ts = new PorterStemFilter(new LowerCaseFilter(ts));
>>>>
>>>>And the rest is a new QueryFilter plugin that I'm calling query-stemmer.
>>>>Here's the full source for the Java file. You can copy the build.xml
>>>>and plugin.xml from query-basic, and alter the names for query-stemmer.
>>>>
>>>>/* Copyright (c) 2003 The Nutch Organization. All rights reserved. */
>>>>/* Use subject to the conditions in http://www.nutch.org/LICENSE.txt. */
>>>>
>>>>package org.apache.nutch.searcher.stemmer;
>>>>
>>>>import org.apache.lucene.search.BooleanQuery;
>>>>import org.apache.lucene.search.PhraseQuery;
>>>>import org.apache.lucene.search.TermQuery;
>>>>import org.apache.lucene.analysis.TokenFilter;
>>>>import org.apache.lucene.analysis.TokenStream;
>>>>import org.apache.lucene.analysis.Token;
>>>>import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>import org.apache.lucene.analysis.LowerCaseFilter;
>>>>import org.apache.lucene.analysis.PorterStemFilter;
>>>>
>>>>import org.apache.nutch.analysis.NutchDocumentAnalyzer;
>>>>import org.apache.nutch.analysis.CommonGrams;
>>>>
>>>>import org.apache.nutch.searcher.QueryFilter;
>>>>import org.apache.nutch.searcher.Query;
>>>>import org.apache.nutch.searcher.Query.*;
>>>>
>>>>import java.io.IOException;
>>>>import java.util.HashSet;
>>>>import java.io.StringReader;
>>>>
>>>>/** The default query filter. Query terms in the default query field
>>>>are
>>>>* expanded to search the url, anchor and content document fields.*/
>>>>public class StemmerQueryFilter implements QueryFilter {
>>>>
>>>> private static float URL_BOOST = 4.0f;
>>>> private static float ANCHOR_BOOST = 2.0f;
>>>>
>>>> private static int SLOP = Integer.MAX_VALUE;
>>>> private static float PHRASE_BOOST = 1.0f;
>>>>
>>>> private static final String[] FIELDS = {"url", "anchor", "content",
>>>>"title"};
>>>> private static final float[] FIELD_BOOSTS = {URL_BOOST, ANCHOR_BOOST,
>>>>1.0f, 2.0f};
>>>>
>>>> /** Set the boost factor for url matches, relative to content and
>>>>anchor
>>>> * matches */
>>>> public static void setUrlBoost(float boost) { URL_BOOST = boost; }
>>>>
>>>> /** Set the boost factor for title/anchor matches, relative to url
>>>>and
>>>> * content matches. */
>>>> public static void setAnchorBoost(float boost) { ANCHOR_BOOST =
>>>>boost; }
>>>>
>>>> /** Set the boost factor for sloppy phrase matches relative to
>>>>unordered
>>>>term
>>>> * matches. */
>>>> public static void setPhraseBoost(float boost) { PHRASE_BOOST =
>>>>boost; }
>>>>
>>>> /** Set the maximum number of terms permitted between matching terms
>>>>in a
>>>> * sloppy phrase match. */
>>>> public static void setSlop(int slop) { SLOP = slop; }
>>>>
>>>> public BooleanQuery filter(Query input, BooleanQuery output) {
>>>> addTerms(input, output);
>>>> addSloppyPhrases(input, output);
>>>> return output;
>>>> }
>>>>
>>>> private static void addTerms(Query input, BooleanQuery output) {
>>>> Clause[] clauses = input.getClauses();
>>>> for (int i = 0; i < clauses.length; i++) {
>>>> Clause c = clauses[i];
>>>>
>>>> if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>> continue; // skip non-default
>>>>fields
>>>>
>>>> BooleanQuery out = new BooleanQuery();
>>>> for (int f = 0; f < FIELDS.length; f++) {
>>>>
>>>> Clause o = c;
>>>> String[] opt;
>>>>
>>>> // TODO: I'm a little nervous about stemming for all default
>>>>fields.
>>>> // Should keep an eye on this.
>>>> if (c.isPhrase()) { // optimize phrase
>>>>clauses
>>>> opt = CommonGrams.optimizePhrase(c.getPhrase(), FIELDS[f]);
>>>> } else {
>>>> System.out.println("o.getTerm = " +
>>>>o.getTerm().toString());
>>>> opt = getStemmedWords(o.getTerm().toString());
>>>> }
>>>> if (opt.length==1) {
>>>> o = new Clause(new Term(opt[0]), c.isRequired(),
>>>>c.isProhibited());
>>>> } else {
>>>> o = new Clause(new Phrase(opt), c.isRequired(),
>>>>c.isProhibited());
>>>> }
>>>>
>>>> out.add(o.isPhrase()
>>>> ? exactPhrase(o.getPhrase(), FIELDS[f],
>>>>FIELD_BOOSTS[f])
>>>> : termQuery(FIELDS[f], o.getTerm(), FIELD_BOOSTS[f]),
>>>> false, false);
>>>> }
>>>> output.add(out, c.isRequired(), c.isProhibited());
>>>> }
>>>> System.out.println("query = " + output.toString());
>>>> }
>>>>
>>>> private static String[] getStemmedWords(String value) {
>>>> StringReader sr = new StringReader(value);
>>>> TokenStream ts = new PorterStemFilter(new
>>>>LowerCaseTokenizer(sr));
>>>>
>>>> String stemmedValue = "";
>>>> try {
>>>> Token token = ts.next();
>>>> int count = 0;
>>>> while (token != null) {
>>>> System.out.println("token = " + token.termText());
>>>> System.out.println("type = " + token.type());
>>>>
>>>> if (count == 0)
>>>> stemmedValue = token.termText();
>>>> else
>>>> stemmedValue = stemmedValue + " " +
>>>>token.termText();
>>>>
>>>> token = ts.next();
>>>> count++;
>>>> }
>>>> } catch (Exception e) {
>>>> stemmedValue = value;
>>>> }
>>>>
>>>> if (stemmedValue.equals("")) {
>>>> stemmedValue = value;
>>>> }
>>>>
>>>> String[] stemmedValues = stemmedValue.split("\\s+");
>>>>
>>>> for (int j=0; j<stemmedValues.length; j++) {
>>>> System.out.println("stemmedValues = " +
>>>>stemmedValues[j]);
>>>> }
>>>> return stemmedValues;
>>>> }
>>>>
>>>>
>>>> private static void addSloppyPhrases(Query input, BooleanQuery
>>>>output) {
>>>> Clause[] clauses = input.getClauses();
>>>> for (int f = 0; f < FIELDS.length; f++) {
>>>>
>>>> PhraseQuery sloppyPhrase = new PhraseQuery();
>>>> sloppyPhrase.setBoost(FIELD_BOOSTS[f] * PHRASE_BOOST);
>>>> sloppyPhrase.setSlop("anchor".equals(FIELDS[f])
>>>> ? NutchDocumentAnalyzer.INTER_ANCHOR_GAP
>>>> : SLOP);
>>>> int sloppyTerms = 0;
>>>>
>>>> for (int i = 0; i < clauses.length; i++) {
>>>> Clause c = clauses[i];
>>>>
>>>> if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>> continue; // skip non-default
>>>>fields
>>>>
>>>> if (c.isPhrase()) // skip exact phrases
>>>> continue;
>>>>
>>>> if (c.isProhibited()) // skip prohibited
>>>>terms
>>>> continue;
>>>>
>>>> sloppyPhrase.add(luceneTerm(FIELDS[f], c.getTerm()));
>>>> sloppyTerms++;
>>>> }
>>>>
>>>> if (sloppyTerms > 1)
>>>> output.add(sloppyPhrase, false, false);
>>>> }
>>>> }
>>>>
>>>>
>>>> private static org.apache.lucene.search.Query
>>>> termQuery(String field, Term term, float boost) {
>>>> TermQuery result = new TermQuery(luceneTerm(field, term));
>>>> result.setBoost(boost);
>>>> return result;
>>>> }
>>>>
>>>> /** Utility to construct a Lucene exact phrase query for a Nutch
>>>>phrase.
>>>>*/
>>>> private static org.apache.lucene.search.Query
>>>> exactPhrase(Phrase nutchPhrase,
>>>> String field, float boost) {
>>>> Term[] terms = nutchPhrase.getTerms();
>>>> PhraseQuery exactPhrase = new PhraseQuery();
>>>> for (int i = 0; i < terms.length; i++) {
>>>> exactPhrase.add(luceneTerm(field, terms[i]));
>>>> }
>>>> exactPhrase.setBoost(boost);
>>>> return exactPhrase;
>>>> }
>>>>
>>>> /** Utility to construct a Lucene Term given a Nutch query term and
>>>>field.
>>>>*/
>>>> private static org.apache.lucene.index.Term luceneTerm(String field,
>>>> Term term) {
>>>> return new org.apache.lucene.index.Term(field, term.toString());
>>>> }
>>>>}
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
>>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general