Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "WritingPluginExample-1.2" page has been changed by NiccoloBecchi.
http://wiki.apache.org/nutch/WritingPluginExample-1.2?action=diff&rev1=13&rev2=14

--------------------------------------------------

  ## page was copied from WritingPluginExample-0.9
  ## page was renamed from WritingPluginExample-0.8
- Most of the text and original code from this page are originally from 
[[WritingPluginExample|WritingPluginExample]].  It's been updated to work with 
the trunk as of revision 506842, and to add unit testing.
+ Most of the text and original code from this page are originally from 
WritingPluginExample.  It's been updated to work with the trunk as of revision 
506842, and to add unit testing.
  
  == The Example ==
- 
  Consider this as a plugin example: We want to be able to recommend specific 
web pages for given search terms.  For this example we'll assume we're indexing 
this site.  As you may have noticed, there are a number of pages that talk 
about plugins.  What we want to do is have it so that if someone searches for 
the term "plugins" we recommend that they start at the PluginCentral page, but 
we also want to return all the normal hits in the expected ranking.  We'll 
seperate the search results page into a section of recommendations and then a 
section with the normal search results.
  
  You go through your site and add meta-tags to pages that list what terms they 
should be recommended for.  The tags look something like this:
@@ -13, +12 @@

  {{{
  <meta name="recommended" content="plugins" />
  }}}
- 
  In order to do this we need to write a plugin that extends 3 different 
extension points.  We need to extend the HTMLParser in order to get the 
recommended terms out of the meta tags.  The !IndexingFilter will need to be 
extended to add a recommended field to the index.  The !QueryFilter needs to be 
extended to add the ability to search againsed the new field in the index.
  
  == Setup ==
- 
  Start by 
[[http://www.apache.org/dev/version-control.html#anon-svn|downloading]] the 
Nutch source code.  Once you've got that make sure it compiles as is before you 
make any changes.  You should be able to get it to compile by running ant from 
the directory you downloaded the source to.  If you have trouble you can write 
to one of the [[Mailing|Mailing Lists]].
  
  Use the source code for the plugins distrubuted with Nutch as a reference.  
They're in [!YourCheckoutDir]/src/plugin.
@@ -25, +22 @@

  For the example we're going to assume that this plugin is something we want 
to contribute back to the Nutch community, so we're going to use the 
directory/package structure of "org/apache/nutch".  If you're writing a plugin 
solely for the use of your organization you'd want to replace that with 
something like "org/my_organization/nutch".
  
  == Required Files ==
- 
  You're going to need to create a directory inside of the plugin directory 
with the name of your plugin ('recommended' in this case) and inside that 
directory you need the following:
  
   * A plugin.xml file that tells nutch about your plugin.
   * A build.xml file that tells ant how to build your plugin.
   * The source code of your plugin in the directory structure 
recommended/src/java/org/apache/nutch/parse/recommended/[Source_Here].
  
- 
  == Plugin.xml ==
- 
  Your plugin.xml file should look like this:
  
  {{{
@@ -74, +68 @@

     <!-- The RecommendedQueryFilter gets called when you perform a search. It 
runs a
          search for the user's query against the recommended fields.  In order 
to get
          add this to the list of filters that gets run by default, you have to 
use
-         "fields=DEFAULT". -->   
+         "fields=DEFAULT". -->
     <extension id="org.apache.nutch.parse.recommended.recommendedSearcher"
                name="Recommended Search Query Filter"
                point="org.apache.nutch.searcher.QueryFilter">
        <implementation id="RecommendedQueryFilter"
                        
class="org.apache.nutch.parse.recommended.RecommendedQueryFilter">
-       <parameter name="fields" value="recommended"/>
+         <parameter name="fields" value="recommended"/>
-       </implementation>
+         </implementation>
     </extension>
  
  </plugin>
  }}}
- 
  == Build.xml ==
- 
  In its simplest form:
  
  {{{
@@ -100, +92 @@

  
  </project>
  }}}
- 
  For Nutch-1.0 write the following:
  
  {{{
@@ -109, +100 @@

  <project name="recommended" default="jar-core">
  
    <import file="../build-plugin.xml"/>
-   
+ 
   <!-- Build compilation dependencies -->
   <target name="deps-jar">
     <ant target="jar" inheritall="false" dir="../lib-xml"/>
@@ -129, +120 @@

     <ant target="deploy" inheritall="false" dir="../protocol-file"/>
   </target>
  
-  
+ 
    <!-- for junit test -->
    <mkdir dir="${build.test}/data"/>
    <copy file="data/recommended.html" todir="${build.test}/data"/>
  </project>
  }}}
- 
  Save this file in directory [!YourCheckoutDir]/src/plugin/recommended
  
  == The HTML Parser Extension ==
- 
  NOTE: Nutch-1.0 users make sure that you save all your java files in this 
directory 
C:\nutch-1.0\src\plugin\recommended\src\java\org\apache\nutch\parse\recommended
  
  This is the source code for the HTML Parser extension.  It tries to grab the 
contents of the recommended meta tag and add them to the document being parsed. 
On the directory , create a file called RecommendedParser.java and add this as 
the contents:
@@ -157, +146 @@

  import org.apache.nutch.parse.HTMLMetaTags;
  import org.apache.nutch.parse.Parse;
  import org.apache.nutch.parse.HtmlParseFilter;
+ import org.apache.nutch.parse.ParseResult;
  import org.apache.nutch.protocol.Content;
  
  // Commons imports
@@ -169, +159 @@

  public class RecommendedParser implements HtmlParseFilter {
  
    private static final Log LOG = 
LogFactory.getLog(RecommendedParser.class.getName());
-   
+ 
    private Configuration conf;
  
    /** The Recommended meta data attribute name */
@@ -178, +168 @@

    /**
     * Scan the HTML document looking for a recommended meta tag.
     */
-   public Parse filter(Content content, Parse parse, 
+   public ParseResult filter(Content content, ParseResult parseResult,
-     HTMLMetaTags metaTags, DocumentFragment doc) {
+       HTMLMetaTags metaTags, DocumentFragment doc) {
-     // Trying to find the document's recommended term
+ 
      String recommendation = null;
  
      Properties generalMetaTags = metaTags.getGeneralTags();
@@ -192, +182 @@

          }
      }
  
+     Parse parse = parseResult.get(content.getUrl());
+ 
      if (recommendation == null) {
          LOG.info("No Recommendation");
      } else {
@@ -199, +191 @@

          parse.getData().getContentMeta().set(META_RECOMMENDED_NAME, 
recommendation);
      }
  
-     return parse;
+     return parseResult;
-   }
-   
+   }
-   
+ 
+ 
    public void setConf(Configuration conf) {
      this.conf = conf;
    }
  
    public Configuration getConf() {
      return this.conf;
-   }  
+   }
+ 
  }
  }}}
- 
  == The Indexer Extension ==
- 
  The following is the code for the Indexing Filter extension.  If the document 
being indexed had a recommended meta tag this extension adds a lucene text 
field to the index called "recommended" with the content of that meta tag. 
Create a file called RecommendedIndexer.java in the source code directory:
  
  {{{
@@ -233, +224 @@

  import org.apache.nutch.fetcher.FetcherOutput;
  import org.apache.nutch.indexer.IndexingFilter;
  import org.apache.nutch.indexer.IndexingException;
+ import org.apache.nutch.indexer.NutchDocument;
  import org.apache.nutch.parse.Parse;
  
  import org.apache.hadoop.conf.Configuration;
@@ -241, +233 @@

  import org.apache.nutch.crawl.Inlinks;
  
  // Lucene imports
+ import org.apache.nutch.indexer.lucene.LuceneWriter;
+ import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
+ import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
  import org.apache.lucene.document.Field;
  import org.apache.lucene.document.Document;
  
  public class RecommendedIndexer implements IndexingFilter {
-     
+ 
    public static final Log LOG = 
LogFactory.getLog(RecommendedIndexer.class.getName());
-   
+ 
    private Configuration conf;
-   
+ 
    public RecommendedIndexer() {
    }
  
-   public Document filter(Document doc, Parse parse, Text url, 
+   public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
+       CrawlDatum datum, Inlinks inlinks) throws IndexingException {
-     CrawlDatum datum, Inlinks inlinks)
-     throws IndexingException {
  
      String recommendation = parse.getData().getMeta("Recommended");
  
-         if (recommendation != null) {
+     if (recommendation != null) {
-             Field recommendedField = 
+         //Field recommendedField =
-                 new Field("recommended", recommendation, 
+         //   new Field("recommended", recommendation,
-                     Field.Store.YES, Field.Index.UN_TOKENIZED);
+         //        Field.Store.YES, Field.Index.UN_TOKENIZED);
-             recommendedField.setBoost(5.0f);
+         //recommendedField.setBoost(5.0f);
-             doc.add(recommendedField);
+         doc.add("recommended", recommendation);
-             LOG.info("Added " + recommendation + " to the recommended Field");
+         LOG.info("Added " + recommendation + " to the recommended Field");
-         }
+     }
  
      return doc;
-   }
+ 
-   
+   }
+ 
    public void setConf(Configuration conf) {
      this.conf = conf;
    }
  
    public Configuration getConf() {
      return this.conf;
-   }  
+   }
+ 
+ 
+ 
+   public void addIndexBackendOptions(Configuration conf)
+   {
+     LuceneWriter.addFieldOptions(
+         "recommended", STORE.YES, INDEX.UNTOKENIZED, conf);
+   }
+ 
  }
  }}}
- 
- Note that the field is UN_TOKENIZED because we don't want the recommended tag 
to be cut up by a tokenizer. Change to TOKENIZED if you want to be able to 
search on parts of the tag, for example to put multiple recommended terms in 
one tag.  
+ Note that the field is UN_TOKENIZED because we don't want the recommended tag 
to be cut up by a tokenizer. Change to TOKENIZED if you want to be able to 
search on parts of the tag, for example to put multiple recommended terms in 
one tag.
  
  == The QueryFilter ==
- 
  The QueryFilter gets called when the user does a search.  We're bumping up 
the boost for the recommended field in order to increase its influence on the 
search results.
  
  {{{
@@ -306, +308 @@

          super("recommended", 5f);
          LOG.info("Added a recommended query");
      }
-   
+ 
  }
  }}}
- 
  == Compiling the plugin ==
- 
  For ant installation in Windows, refer this - 
[[http://ant.apache.org/manual/install.html|ant]]
  
  In order to build the plugin - or Nutch itself - you'll need ant.  If you're 
using MacOs you can easily get it via [[http://fink.sourceforge.net/|fink]].  
Let's get junit while we're at it.
@@ -319, +319 @@

  {{{
  fink install ant ant-junit junit
  }}}
- 
  In order to build it, change to your plugin's directory where you saved the 
build.xml file (probably [!YourCheckoutDir]/src/plugin/recommended), and simply 
type
  
  {{{
  ant
  }}}
- 
  Hopefully you'll get a long string of text, followed by a message telling you 
of a successful build.
  
  === Getting Ant to Compile Your Plugin ===
- 
  In order for ant to compile and deploy your plugin on the global build you 
need to edit the src/plugin/build.xml file (NOT the build.xml in the root of 
your checkout directory). You'll see a number of lines that look like
+ 
  {{{
    <ant dir="[plugin-name]" target="deploy" />
  }}}
- 
- Edit this block to add a line for your plugin before the </target> tag. 
+ Edit this block to add a line for your plugin before the </target> tag.
  
  {{{
    <ant dir="recommended" target="deploy" />
  }}}
- 
  Running 'ant' in the root of your checkout directory should get everything 
compiled and jared up.  The next time you run a crawl your parser and index 
filter should get used.
  
  You'll need to run 'ant war' to compile a new ROOT.war file.  Once you've 
deployed that, your query filter should get used when searches are performed.
  
  == Unit testing ==
- 
  We'll need to create two files for unit testing:  a page we'll do the testing 
against, and a class to do the testing with.  Again, let's assume your plugin 
directory is [!YourCheckoutDir]/src/plugin and that your test plugin is under 
that directory.  Create directory recommended/data, and under it make a new 
file called recommended.html
  
  {{{
@@ -366, +361 @@

  </body>
  </html>
  }}}
- 
  This file contains the meta tag we're currently parsing for, with the value 
'''recommended-content'''.  After that gratuitous bit of free publicity for my 
current favorite editor, let's move on to the testing class.
  
  Create a new tree structure, this time for the test code, for example 
recommended/src/test/org/apache/nutch/parse/recommended/[Test_Source_Here].  
There you'll create a file called TestRecommendedParser.java.
@@ -376, +370 @@

  
  import org.apache.nutch.metadata.Metadata;
  import org.apache.nutch.parse.Parse;
+ import org.apache.nutch.parse.ParseResult;
  import org.apache.nutch.parse.ParseUtil;
  import org.apache.nutch.protocol.Content;
  import org.apache.hadoop.conf.Configuration;
@@ -388, +383 @@

  import junit.framework.TestCase;
  
  /*
-  * Loads test page recommended.html and verifies that the recommended 
+  * Loads test page recommended.html and verifies that the recommended
   * meta tag has recommended-content as its value.
   *
   */
  public class TestRecommendedParser extends TestCase {
  
    private static final File testDir =
-     new File(System.getProperty("test.data"));
+     //new File(System.getProperty("test.data"));
+     new File("/work/nutch-1.2/src/plugin/recommended/data");
  
    public void testPages() throws Exception {
      pageTest(new File(testDir, "recommended.html"), "http://foo.com/";,
@@ -421, +417 @@

  
      Content content =
        new Content(url, url, bytes, contentType, new Metadata(), conf);
-     Parse parse = new 
ParseUtil(conf).parseByExtensionId("parse-html",content);
+     ParseResult parseResult = new 
ParseUtil(conf).parseByExtensionId("parse-html",content);
- 
-     Metadata metadata = parse.getData().getContentMeta();
+     Metadata metadata = parseResult.get(url).getData().getContentMeta();
      assertEquals(recommendation, metadata.get("Recommended"));
      assertTrue("somesillycontent" != metadata.get("Recommended"));
    }
  }
  }}}
- 
- As you can see, this code first parses the document, looks for the 
'''Recommended''' item in the object contentMeta - which we saved on 
RecommendedParser - and verifies that it's set to value 
'''recommended-content'''.  
+ As you can see, this code first parses the document, looks for the 
'''Recommended''' item in the object contentMeta - which we saved on 
RecommendedParser - and verifies that it's set to value 
'''recommended-content'''.
  
  Now add some lines to the build.xml file located in 
[!YourCheckoutDir]/src/plugin/recommended directory, so that at a minimum its 
contents are:
+ 
  {{{
  <?xml version="1.0"?>
  
@@ -448, +443 @@

  }}}
  These lines will copy the test data to the proper directory for testing.
  
- To run the test case, simply move back to your plugin's root directory and 
execute
+ To run the test case, simply move back to your src plugin's root directory 
and execute.
  
  {{{
  ant test
  }}}
- 
+ To debug this code on eclipse it's essential that you make a  recommended 
directory  under the main plugins of your nutch installation and put there the 
plugin.xml file (and for run the .jar too).
  
  == Getting Nutch to Use Your Plugin ==
- 
  In order to get Nutch to use your plugin, you need to edit your 
conf/nutch-site.xml file and add in a block like this:
  
  {{{
@@ -471, +465 @@

    </description>
  </property>
  }}}
- 
  You'll want to edit the regular expression so that it includes the id of your 
plugin.
  
  {{{
    
<value>recommended|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  }}}
- 
- 
  <<< See also: HowToContribute
  
  <<< PluginCentral

Reply via email to