Hi, Matthew
I think we should use fieldName instead of field, or not...
===============stemming code begin=======================
public TokenStream tokenStream(String field, Reader reader) {
Analyzer analyzer;
if ("anchor".equals(field)) {
analyzer = ANCHOR_ANALYZER;
}
else {
analyzer = CONTENT_ANALYZER;
TokenStream ts = analyzer.tokenStream(field, reader);
if (field.equals("content") || field.equals("title")) {
ts = new LowerCaseFilter(ts);
return new PorterStemFilter(ts);
}
else {
return ts;
}
}
}
===============stemming code end=======================
P.S. this patch doesn't take any effect on russian language.
Regards,
Alexey
------------------------------
Howie,
Thanks for all the help configuring your stemming addon for version
0.8. I compared query-basic and query-stemmer and the only new feature
that was added is a "host" boost. I made the changes and everything
works perfect.
I uploaded the code to the wiki for both version 0.7.2 and 0.8. You can
access it at the below URL..
http://wiki.apache.org/nutch/FAQ#head-fa0c678473eeecf3771e490b22d385054697232c
Take care,
Matt
Howie Wang wrote:
> Hi, Matt,
>
> In 0.7, you wouldn't miss anything. That code was written to
> replace the basic query filter, and handled all the fields that
> basic query filter was handling. For 0.8, I'm really not sure.
> I'm guessing the code is fairly simple still in 0.8. You can probably
> figure out if query-basic in 0.8 is doing something appreciably different
> than query-stemmer by just visually comparing the files.
>
> Howie
>
>> Howie,
>> The query-stemmer works great as long as query-basic is not enabled.
>> However, if I don't have query-basic enabled, won't I be missing some
>> needed functionality?
>> Matt
>>
>> Howie Wang wrote:
>>> Hi,
>>>
>>> The settings look reasonable. But for testing purposes, I would get
>>> rid of
>>> the other query filters and put in some print statements in the
>>> query-stemmer to see what's happening.
>>>
>>> Howie
>>>
>>>> In my nutch-site.xml I overrode the plugin.includes property as below:
>>>>
>>>> <property>
>>>> <name>plugin.includes</name>
>>>>
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
>>>>
>>>>
>>>>
>>>> <description>Regular expression naming plugin directory names to
>>>> include. Any plugin not matching this expression is excluded.
>>>> In any case you need at least include the nutch-extensionpoints
>>>> plugin. By
>>>> default Nutch includes crawling just HTML and plain text via HTTP,
>>>> and basic indexing and search plugins.
>>>> </description>
>>>> </property>
>>>>
>>>>
>>>> However, it is still only letting me search for the stemmed term
>>>> (IE "Interview" returns results but "interviewed" doesnt, even
>>>> though thats the word thats actually on the page).
>>>>
>>>> I tried a different approach and removed the query-stemmer value
>>>> from nutch-site.xml to attempt to disable the plugin. I reran the
>>>> crawl and it didn't load the plugin. However, it still had the same
>>>> stemming functionality. I'm guessing this is due to editing the
>>>> main files such as CommonGrams.java and NutchDocumentAnalyzer.java.
>>>> Should I attempt too copy the needed methods into
>>>> StemmerQueryFilter.java and try to isolate all functionality to the
>>>> plugin alone?
>>>>
>>>> Thanks,
>>>> Matt
>>>>
>>>> Howie Wang wrote:
>>>>> It sounds like the query-stemmer is not being called.
>>>>> The query string "interviews" needs to be processed
>>>>> into "interview". Are you sure that your nutch-default.xml
>>>>> is including the query-stemmer correctly? Put print statements
>>>>> in to see if it's getting there.
>>>>>
>>>>> By the way, someone recently told me that they
>>>>> were able to put all the stemming code into an indexing
>>>>> filter without touching any of the main code. All they
>>>>> did was to copy some of the code that is being done
>>>>> in NutchDocumentAnalyzer and CommonGrams into
>>>>> their custom index filter. Haven't tried it myself.
>>>>>
>>>>> HTH
>>>>> Howie
>>>>>
>>>>>> Ok. I did this for Nutch 0.8 (had to edit the listed code some to
>>>>>> make up for changes from .7.2 to .8 - mostly having to do with
>>>>>> the Configuration type being needed).
>>>>>>
>>>>>> It partially works.
>>>>>>
>>>>>> If the page I'm trying to index contains the word "interviews"
>>>>>> and I type in the search engine "interview", the stemming takes
>>>>>> place and the page with the word "interviews" is returned.
>>>>>> However, if I type in the word "interviews" no page is returned.
>>>>>> (The page with the word interviews on it should be returned).
>>>>>>
>>>>>> Any ideas??
>>>>>> Matt
>>>>>>
>>>>>> Dima Mazmanov wrote:
>>>>>>> Hi, .
>>>>>>>
>>>>>>> I've gotten a couple of questions offlist about stemming
>>>>>>> so I thought I'd just post here with my changes. Sorry that
>>>>>>> some of the changes are in the main code and not in a plugin. It
>>>>>>> seemed that it's more efficient to put in the main analyzer. It
>>>>>>> would be nice if later releases could add support for plugging
>>>>>>> in a custom stemmer/analyzer.
>>>>>>>
>>>>>>> The first change I made is in NutchDocumentAnalyzer.java.
>>>>>>>
>>>>>>> Import the following classes at the top of the file:
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> Change tokenStream to:
>>>>>>>
>>>>>>> public TokenStream tokenStream(String field, Reader reader) {
>>>>>>> TokenStream ts = CommonGrams.getFilter(new
>>>>>>> NutchDocumentTokenizer(reader),
>>>>>>> field);
>>>>>>> if (field.equals("content") || field.equals("title")) {
>>>>>>> ts = new LowerCaseFilter(ts);
>>>>>>> return new PorterStemFilter(ts);
>>>>>>> } else {
>>>>>>> return ts;
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> The second change is in CommonGrams.java.
>>>>>>> Import the following classes near the top:
>>>>>>>
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> In optimizePhrase, after this line:
>>>>>>>
>>>>>>> TokenStream ts = getFilter(new ArrayTokens(phrase), field);
>>>>>>>
>>>>>>> Add:
>>>>>>>
>>>>>>> ts = new PorterStemFilter(new LowerCaseFilter(ts));
>>>>>>>
>>>>>>> And the rest is a new QueryFilter plugin that I'm calling
>>>>>>> query-stemmer.
>>>>>>> Here's the full source for the Java file. You can copy the
>>>>>>> build.xml
>>>>>>> and plugin.xml from query-basic, and alter the names for
>>>>>>> query-stemmer.
>>>>>>>
>>>>>>> /* Copyright (c) 2003 The Nutch Organization. All rights
>>>>>>> reserved. */
>>>>>>> /* Use subject to the conditions in
>>>>>>> http://www.nutch.org/LICENSE.txt. */
>>>>>>>
>>>>>>> package org.apache.nutch.searcher.stemmer;
>>>>>>>
>>>>>>> import org.apache.lucene.search.BooleanQuery;
>>>>>>> import org.apache.lucene.search.PhraseQuery;
>>>>>>> import org.apache.lucene.search.TermQuery;
>>>>>>> import org.apache.lucene.analysis.TokenFilter;
>>>>>>> import org.apache.lucene.analysis.TokenStream;
>>>>>>> import org.apache.lucene.analysis.Token;
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> import org.apache.nutch.analysis.NutchDocumentAnalyzer;
>>>>>>> import org.apache.nutch.analysis.CommonGrams;
>>>>>>>
>>>>>>> import org.apache.nutch.searcher.QueryFilter;
>>>>>>> import org.apache.nutch.searcher.Query;
>>>>>>> import org.apache.nutch.searcher.Query.*;
>>>>>>>
>>>>>>> import java.io.IOException;
>>>>>>> import java.util.HashSet;
>>>>>>> import java.io.StringReader;
>>>>>>>
>>>>>>> /** The default query filter. Query terms in the default query
>>>>>>> field are
>>>>>>> * expanded to search the url, anchor and content document fields.*/
>>>>>>> public class StemmerQueryFilter implements QueryFilter {
>>>>>>>
>>>>>>> private static float URL_BOOST = 4.0f;
>>>>>>> private static float ANCHOR_BOOST = 2.0f;
>>>>>>>
>>>>>>> private static int SLOP = Integer.MAX_VALUE;
>>>>>>> private static float PHRASE_BOOST = 1.0f;
>>>>>>>
>>>>>>> private static final String[] FIELDS = {"url", "anchor",
>>>>>>> "content",
>>>>>>> "title"};
>>>>>>> private static final float[] FIELD_BOOSTS = {URL_BOOST,
>>>>>>> ANCHOR_BOOST,
>>>>>>> 1.0f, 2.0f};
>>>>>>>
>>>>>>> /** Set the boost factor for url matches, relative to content
>>>>>>> and anchor
>>>>>>> * matches */
>>>>>>> public static void setUrlBoost(float boost) { URL_BOOST =
>>>>>>> boost; }
>>>>>>>
>>>>>>> /** Set the boost factor for title/anchor matches, relative to
>>>>>>> url and
>>>>>>> * content matches. */
>>>>>>> public static void setAnchorBoost(float boost) { ANCHOR_BOOST
>>>>>>> = boost; }
>>>>>>>
>>>>>>> /** Set the boost factor for sloppy phrase matches relative to
>>>>>>> unordered
>>>>>>> term
>>>>>>> * matches. */
>>>>>>> public static void setPhraseBoost(float boost) { PHRASE_BOOST
>>>>>>> = boost; }
>>>>>>>
>>>>>>> /** Set the maximum number of terms permitted between matching
>>>>>>> terms in a
>>>>>>> * sloppy phrase match. */
>>>>>>> public static void setSlop(int slop) { SLOP = slop; }
>>>>>>>
>>>>>>> public BooleanQuery filter(Query input, BooleanQuery output) {
>>>>>>> addTerms(input, output);
>>>>>>> addSloppyPhrases(input, output);
>>>>>>> return output;
>>>>>>> }
>>>>>>>
>>>>>>> private static void addTerms(Query input, BooleanQuery output) {
>>>>>>> Clause[] clauses = input.getClauses();
>>>>>>> for (int i = 0; i < clauses.length; i++) {
>>>>>>> Clause c = clauses[i];
>>>>>>>
>>>>>>> if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>> continue; // skip
>>>>>>> non-default fields
>>>>>>>
>>>>>>> BooleanQuery out = new BooleanQuery();
>>>>>>> for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>
>>>>>>> Clause o = c;
>>>>>>> String[] opt;
>>>>>>>
>>>>>>> // TODO: I'm a little nervous about stemming for all
>>>>>>> default fields.
>>>>>>> // Should keep an eye on this.
>>>>>>> if (c.isPhrase()) { // optimize
>>>>>>> phrase
>>>>>>> clauses
>>>>>>> opt = CommonGrams.optimizePhrase(c.getPhrase(),
>>>>>>> FIELDS[f]);
>>>>>>> } else {
>>>>>>> System.out.println("o.getTerm = " +
>>>>>>> o.getTerm().toString());
>>>>>>> opt = getStemmedWords(o.getTerm().toString());
>>>>>>> }
>>>>>>> if (opt.length==1) {
>>>>>>> o = new Clause(new Term(opt[0]), c.isRequired(),
>>>>>>> c.isProhibited());
>>>>>>> } else {
>>>>>>> o = new Clause(new Phrase(opt), c.isRequired(),
>>>>>>> c.isProhibited());
>>>>>>> }
>>>>>>>
>>>>>>> out.add(o.isPhrase()
>>>>>>> ? exactPhrase(o.getPhrase(), FIELDS[f],
>>>>>>> FIELD_BOOSTS[f])
>>>>>>> : termQuery(FIELDS[f], o.getTerm(),
>>>>>>> FIELD_BOOSTS[f]),
>>>>>>> false, false);
>>>>>>> }
>>>>>>> output.add(out, c.isRequired(), c.isProhibited());
>>>>>>> }
>>>>>>> System.out.println("query = " + output.toString());
>>>>>>> }
>>>>>>>
>>>>>>> private static String[] getStemmedWords(String value) {
>>>>>>> StringReader sr = new StringReader(value);
>>>>>>> TokenStream ts = new PorterStemFilter(new
>>>>>>> LowerCaseTokenizer(sr));
>>>>>>>
>>>>>>> String stemmedValue = "";
>>>>>>> try {
>>>>>>> Token token = ts.next();
>>>>>>> int count = 0;
>>>>>>> while (token != null) {
>>>>>>> System.out.println("token = " +
>>>>>>> token.termText());
>>>>>>> System.out.println("type = " + token.type());
>>>>>>>
>>>>>>> if (count == 0)
>>>>>>> stemmedValue = token.termText();
>>>>>>> else
>>>>>>> stemmedValue = stemmedValue + " " +
>>>>>>> token.termText();
>>>>>>>
>>>>>>> token = ts.next();
>>>>>>> count++;
>>>>>>> }
>>>>>>> } catch (Exception e) {
>>>>>>> stemmedValue = value;
>>>>>>> }
>>>>>>>
>>>>>>> if (stemmedValue.equals("")) {
>>>>>>> stemmedValue = value;
>>>>>>> }
>>>>>>>
>>>>>>> String[] stemmedValues = stemmedValue.split("\\s+");
>>>>>>>
>>>>>>> for (int j=0; j<stemmedValues.length; j++) {
>>>>>>> System.out.println("stemmedValues = " +
>>>>>>> stemmedValues[j]);
>>>>>>> }
>>>>>>> return stemmedValues;
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> private static void addSloppyPhrases(Query input, BooleanQuery
>>>>>>> output) {
>>>>>>> Clause[] clauses = input.getClauses();
>>>>>>> for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>
>>>>>>> PhraseQuery sloppyPhrase = new PhraseQuery();
>>>>>>> sloppyPhrase.setBoost(FIELD_BOOSTS[f] * PHRASE_BOOST);
>>>>>>> sloppyPhrase.setSlop("anchor".equals(FIELDS[f])
>>>>>>> ? NutchDocumentAnalyzer.INTER_ANCHOR_GAP
>>>>>>> : SLOP);
>>>>>>> int sloppyTerms = 0;
>>>>>>>
>>>>>>> for (int i = 0; i < clauses.length; i++) {
>>>>>>> Clause c = clauses[i];
>>>>>>>
>>>>>>> if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>> continue; // skip
>>>>>>> non-default fields
>>>>>>>
>>>>>>> if (c.isPhrase()) // skip exact
>>>>>>> phrases
>>>>>>> continue;
>>>>>>>
>>>>>>> if (c.isProhibited()) // skip
>>>>>>> prohibited terms
>>>>>>> continue;
>>>>>>>
>>>>>>> sloppyPhrase.add(luceneTerm(FIELDS[f], c.getTerm()));
>>>>>>> sloppyTerms++;
>>>>>>> }
>>>>>>>
>>>>>>> if (sloppyTerms > 1)
>>>>>>> output.add(sloppyPhrase, false, false);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> private static org.apache.lucene.search.Query
>>>>>>> termQuery(String field, Term term, float boost) {
>>>>>>> TermQuery result = new TermQuery(luceneTerm(field, term));
>>>>>>> result.setBoost(boost);
>>>>>>> return result;
>>>>>>> }
>>>>>>>
>>>>>>> /** Utility to construct a Lucene exact phrase query for a
>>>>>>> Nutch phrase.
>>>>>>> */
>>>>>>> private static org.apache.lucene.search.Query
>>>>>>> exactPhrase(Phrase nutchPhrase,
>>>>>>> String field, float boost) {
>>>>>>> Term[] terms = nutchPhrase.getTerms();
>>>>>>> PhraseQuery exactPhrase = new PhraseQuery();
>>>>>>> for (int i = 0; i < terms.length; i++) {
>>>>>>> exactPhrase.add(luceneTerm(field, terms[i]));
>>>>>>> }
>>>>>>> exactPhrase.setBoost(boost);
>>>>>>> return exactPhrase;
>>>>>>> }
>>>>>>>
>>>>>>> /** Utility to construct a Lucene Term given a Nutch query
>>>>>>> term and field.
>>>>>>> */
>>>>>>> private static org.apache.lucene.index.Term luceneTerm(String
>>>>>>> field,
>>>>>>> Term
>>>>>>> term) {
>>>>>>> return new org.apache.lucene.index.Term(field,
>>>>>>> term.toString());
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>
>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general