Hi Yann, > In Parse type, we don't have "getData()" so we can't add new metadata. ... > So what is the new way to add custom field to index ? Maybe i miss > something ...
In 2.x data for custom fields can be added to the WebPage's metadata in ParseFilter via page.putToMetadata(Utf8 key, ByteBuffer value) It's then read in IndexingFilter by page.getFromMetadata(Utf8 key) Sebastian On 04/02/2014 05:42 PM, Yann Levreau wrote: > Hello, > > Maybe this is the wrong place to post a request so forgive me, but I really > need some help (Nutch 2.2.1) : > > I need to add a new field to be indexed by ElasticSearch. > > in 1.7, we had : > The HtmlParseFilter extension with : > ParseResult<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html> > *filter > <http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HtmlParseFilter.html#filter%28org.apache.nutch.protocol.Content,%20org.apache.nutch.parse.ParseResult,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29>* > (Content<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/protocol/Content.html> > content, > ParseResult<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html> > parseResult, > HTMLMetaTags<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HTMLMetaTags.html> > metaTags, > DocumentFragment<http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true> > doc) > > The IndexingFilter extension with : > NutchDocument<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html> > *filter > <http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20org.apache.nutch.parse.Parse,%20org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20org.apache.nutch.crawl.Inlinks%29>* > (NutchDocument<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html> > doc, > Parse<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/Parse.html> > parse, > org.apache.hadoop.io.Text url, > CrawlDatum<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/CrawlDatum.html> > datum, > Inlinks<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/Inlinks.html> > inlinks) > > All was ok to add field. > > in 2.2.1 we have : > The ParseFilter extension : > Parse<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html> > *filter > <http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/ParseFilter.html#filter%28java.lang.String,%20org.apache.nutch.storage.WebPage,%20org.apache.nutch.parse.Parse,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29>* > (String<http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true> > url, > WebPage<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html> > page, > Parse<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html> > parse, > HTMLMetaTags<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/HTMLMetaTags.html> > metaTags, > DocumentFragment<http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true> > doc) > In Parse type, we don't have "getData()" so we can't add new metadata. > > The IndexingFilter extension : > NutchDocument<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html> > *filter > <http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20java.lang.String,%20org.apache.nutch.storage.WebPage%29>* > (NutchDocument<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html> > doc, > String<http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true> > url, > WebPage<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html> > page) > We don't have Parse type in parameter to add field to NutchDocument type. > > So what is the new way to add custom field to index ? Maybe i miss > something ... > Thank you very much ! >

