Hello Ahmad and all nutch-users

I would to thank you for your response.
Exactly as you say there isn't the same interface with the version 1.0 of
NUTCH
So actually i don't have error in building the plugin but the problem is
that no new field appears on the index
when i display the index using Luke .

So the followings are my 3 new classes of the plugin "author" which don't
make any error in compiling:

1 )CLASS AuthorIndexer :
----------------------------

package org.apache.nutch.parse.author;

// JDK import
import java.util.logging.Logger;

// Commons imports
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;


// Nutch imports
import org.apache.nutch.util.LogUtil;
import org.apache.nutch.fetcher.FetcherOutput;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.parse.Parse;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;

// Lucene imports
//import org.apache.lucene.document.Field;
//
//import org.apache.lucene.document.Document;

import org.apache.nutch.indexer.field.*;
import org.apache.nutch.indexer.NutchDocument;

import org.apache.nutch.indexer.lucene.LuceneWriter;

public class AuthorIndexer implements IndexingFilter {

  public static final Log LOG =
LogFactory.getLog(AuthorIndexer.class.getName());

  private Configuration conf;

  public AuthorIndexer() {
  }

  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
    CrawlDatum datum, Inlinks inlinks)
    throws IndexingException {

    String recommendation = parse.getData().getMeta("Author");

        if (recommendation != null) {

        doc.add("author",recommendation);
            LOG.info("Added " + recommendation + " to the author Field");
    }

    return doc;
  }

  public void addIndexBackendOptions(Configuration conf){
    // stored, indexed and un-tokenized
    LuceneWriter.addFieldOptions("author",
LuceneWriter.STORE.YES,LuceneWriter.INDEX.UNTOKENIZED, conf);


  }
  public void setConf(Configuration conf) {
    this.conf = conf;
  }

  public Configuration getConf() {
    return this.conf;
  }

}



2) Class AuthorParser :
----------------------------

package org.apache.nutch.parse.author;

// JDK imports
import java.util.Enumeration;
import java.util.Properties;
import java.util.logging.Logger;

// Nutch imports
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.parse.HTMLMetaTags;
import org.apache.nutch.parse.ParseResult;
import org.apache.nutch.parse.HtmlParseFilter;
import org.apache.nutch.protocol.Content;

// Commons imports
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

// W3C imports
import org.w3c.dom.DocumentFragment;

public class AuthorParser implements HtmlParseFilter {

  private static final Log LOG =
LogFactory.getLog(AuthorParser.class.getName());

  private Configuration conf;

  /** The Author meta data attribute name */
  public static final String META_RECOMMENDED_NAME="Author";

  /**
   * Scan the HTML document looking for a author meta tag.
   */
  public ParseResult filter(Content content, ParseResult parse,HTMLMetaTags
metaTags, DocumentFragment doc) {
    // Trying to find the document's author term
    String recommendation = null;

    Properties generalMetaTags = metaTags.getGeneralTags();

    for (Enumeration tagNames = generalMetaTags.propertyNames();
tagNames.hasMoreElements(); ) {
        if (tagNames.nextElement().equals("author")) {
           recommendation = generalMetaTags.getProperty("author");
           LOG.info("Found a Recommendation for " + recommendation);
        }
    }

    if (recommendation == null) {
        LOG.info("No Recommendation");
    } else {
        LOG.info("Adding Recommendation for " + recommendation);

//we will inject information
parse.get("author").getData().getContentMeta().set(META_RECOMMENDED_NAME,
recommendation);

}

    return parse;
  }


  public void setConf(Configuration conf) {
    this.conf = conf;
  }

  public Configuration getConf() {
    return this.conf;
  }
}


3) Class AuthorQueryFilter :
---------------------------------

package org.apache.nutch.parse.author;

import org.apache.nutch.searcher.FieldQueryFilter;

import java.util.logging.Logger;

// Commons imports
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;


public class AuthorQueryFilter extends FieldQueryFilter {
    private static final Log LOG =
LogFactory.getLog(AuthorParser.class.getName());

    public AuthorQueryFilter() {
        super("author", 5f);
        LOG.info("Added a author query");
    }

}

In addition in nutch-site.xml i have added the term "author" to other
plugins to be used .

In build.xml on the racine of plugins i have added the line <ant
dir="author" target="deploy"/>

Also in the schema.xml file of nutch i have added the line <field
name="author" type="string" stored="true" indexed="true"/>

Therefore i have built Nutch with ant and it works correctly .

I dont know why the field "author" doesn't appear  in the final index .


After I 'have removed the plugin "author " and i have activate the plugin
"feed" which comes with Nutch in the plugin

directory and which contains a field named "author" . (desactivate on
default )

With this , the new field appears in the index plus 3 other fields .



MY problem is to add a new field which not exactly the "author" field : For
example COUNTRY or ACTIVITY

which i want to add it manuallly to the results of NUTCH maybe in using the
URL DOMAIN NAME .


SO I M BLOCKING NOW IN THE ADD OF NEW FIELD "AUTHOR" AND I DONT KNOW FROM
WHERE THE  PROBLEM

COMES.

NOTE:
I m using NUTCH with SOLR , so i m not sure if the problem depends or not.


I NEED YOUR HELP PLEASE.
THANKS.


2010/3/22 Ahmad Al-Amri <amri...@yahoo.com>

>
> Hello;
>
> The filter method in the 0.9 example is not the same with 1.0 ver.
> interface that implemented.
>
> note that it is return  "Document" but 1.0 one returns  "NutchDocument"
> ....
> and there is bit deference in reading meta tags ...
>
> check this very helpful links:
>
> http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html
> http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html
>
> I assume you added your "author" folder that contains .jar file to your
> plugin directory!
>
>
> Regards;
> Ahmad Al-Amri
>
>
>
>
>
>
>
> ________________________________
> From: Arnaud Garcia <arnaud1...@gmail.com>
> To: nutch-user@lucene.apache.org
> Sent: Wed, March 17, 2010 8:25:26 AM
> Subject: Re: Plugin installed , deployed and works correctly but no new
> field  in the index ????????????
>
> 2010/3/17 Arnaud Garcia <arnaud1...@gmail.com>
>
> >
> >
> > 2010/3/17 Arnaud Garcia <arnaud1...@gmail.com>
> >
> > Hello everybody
> >>
> >> I m trying to add new plugin to Nutch as it s explain in the howto
> >> WritingPluginExample on the apache wiki.
> >>
> >> Because the Example about the plugin on the wiki is for the version 0.9
> ,
> >> i m switching to Nutch 0.9 after getting a lot of error with Nutch1.0.
> >>
> >> My new plugin named author will extract the value of the tag author from
> >> pages crawled.
> >>
> >> The method used is exactly the same method on the wiki  with the name
> >> "author" in place of 'recommended" .
> >>
> >> So , all things are built successfully , (plugin (separately)+ Nutch )
>  ,
> >> the name of the plugin ("author) was added in nutch-site.xml file ,
> >>
> >> and the balise  <ant dir="author" target="deploy"/>  was added correctly
> >> in the file /nutch/src/plugin/ and
> >>
> >> the "author" directory was been created on the directory /nutch/build/ .
> >>
> >>
> >> THE PROBLEM IS :
> >>
> >> No new field named "author" exists in the index .
> >>
> >> I m using Luke to read and display the index but theresn't any trace
> about
> >> the new field  "author" .
> >>
> >> I have verified that the tag named author exists in the Web page which i
> >> crawl.
> >>
> >>
> >> ANYONE know from where the problem may come
> >>
> >> CAN ANYONE HELP ME PLEASE.
> >> Best regards
> >> THANKS
> >>
> > s
>
>
>
>
>

Reply via email to