Hello,

Maybe this is the wrong place to post a request so forgive me, but I really
need some help (Nutch 2.2.1) :

I need to add a new field to be indexed by ElasticSearch.

in 1.7, we had :
The HtmlParseFilter extension with :
ParseResult<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html>
*filter
<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HtmlParseFilter.html#filter%28org.apache.nutch.protocol.Content,%20org.apache.nutch.parse.ParseResult,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29>*
(Content<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/protocol/Content.html>
content,
ParseResult<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html>
parseResult,
HTMLMetaTags<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HTMLMetaTags.html>
metaTags,
DocumentFragment<http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true>
 doc)

The IndexingFilter extension with :
NutchDocument<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html>
*filter
<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20org.apache.nutch.parse.Parse,%20org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20org.apache.nutch.crawl.Inlinks%29>*
(NutchDocument<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html>
doc,
Parse<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/Parse.html>
parse,
org.apache.hadoop.io.Text url,
CrawlDatum<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/CrawlDatum.html>
datum,
Inlinks<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/Inlinks.html>
 inlinks)

All was ok to add field.

in 2.2.1 we have :
The ParseFilter extension :
  Parse<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html>
*filter
<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/ParseFilter.html#filter%28java.lang.String,%20org.apache.nutch.storage.WebPage,%20org.apache.nutch.parse.Parse,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29>*
(String<http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true>
url,
WebPage<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html>
page,
Parse<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html>
parse,
HTMLMetaTags<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/HTMLMetaTags.html>
metaTags,
DocumentFragment<http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true>
 doc)
In Parse type, we don't have "getData()" so we can't add new metadata.

The IndexingFilter extension :
NutchDocument<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html>
*filter
<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20java.lang.String,%20org.apache.nutch.storage.WebPage%29>*
(NutchDocument<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html>
doc,
String<http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true>
url,
WebPage<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html>
 page)
We don't have Parse type in parameter to add field to NutchDocument type.

So what is the new way to add custom field to index ? Maybe i miss
something ...
Thank you very much !

Reply via email to