Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "WritingPluginExample" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/WritingPluginExample?action=diff&rev1=15&rev2=16

  This plugin example focuses on the urlmeta plugin which which is packaged 
with Nutch-1.3. It aims to provide a comprehensive introduction to plugin 
development for Apache Nutch.
  
  == The Example ==
- Consider this as a plugin example: We want to be able to recommend specific 
web pages for given search terms.  For this example we'll assume we're crawling 
this site with Nutch and indexing it with Apache Solr. As you may have noticed, 
there are a number of pages that talk about plugins. If someone searches for 
the term "plugin", we want the first hit returned to be the Nutch PluginCentral 
page, however we also want to return all the normal hits in the expected 
ranking.
+ Consider this as a plugin example: We want to be able to recommend specific 
web pages for given search terms.  For this example we'll assume we're crawling 
this site with Nutch and indexing it with Apache Solr. As you may have noticed, 
there are a number of pages that talk about plugins. If someone searches for 
the term "plugin", we want the first hit returned to be the Nutch PluginCentral 
page, however we also want to return all the normal hits in the expected 
ranking. 
+ 
+ This is where we find a use for the urlmeta plugin. It is designed to enhance 
the original [[https://issues.apache.org/jira/browse/NUTCH-655|NUTCH-655 
patch]], by doing two things: 
+  1. Meta Tags that are supplied with your Crawl URLs, during injection, will 
be propagated throughout the out-links of those Crawl URLs 
+  2. When you index your URLs, the meta tags that you specified with your URLs 
will be indexed alongside those URLs--and can be directly queried, assuming you 
have done everything else correctly.
  
  In order to do this we go through our site and add meta-tags to pages that 
list what terms they should be recommended for. The tags look something like 
this:
  
  {{{
  <meta name="recommended" content="plugins" />
  }}}
+ 
  In order to do this we need to write a plugin that extends 3 different 
extension points.  We need to extend the HTMLParser (which in turn extends the 
[[http://nutch.apache.org/apidocs-1.3/org/apache/nutch/parse/Parser.html|Parser]]
 class) in order to get the recommended terms out of the meta tags.  The 
[[http://nutch.apache.org/apidocs-1.3/org/apache/nutch/indexer/IndexingFilter.html|IndexingFilter]]
 will need to be extended to add a recommended field to the index.  Finally we 
need to add the new field to our Nutch schema.xml which will add the ability to 
search against the new field in the index.
  
  == Setup ==
  Start by 
[[http://svn.apache.org/repos/asf/nutch/tags/release-1.3/|downloading]] the 
Nutch-1.3 source code.  Once you've got that make sure it compiles as is before 
you decide to make any changes.  You should be able to get it to compile by 
running ant from the directory you downloaded the source to.  If you have 
trouble you can write to one of the [[Mailing|Mailing Lists]].
  
- Use the source code for the plugins distrubuted with Nutch as a reference.  
They're in [!YourCheckoutDir]/src/plugin.
+ Use the source code for the plugins distributed with Nutch as a reference. 
They're in $NUTCH_HOME/src/plugin. In particular we focus on the urlmeta plugin 
within this example.
  
- For the example we're going to assume that this plugin is something we want 
to contribute back to the Nutch community, so we're going to use the 
directory/package structure of "org/apache/nutch".  If you're writing a plugin 
solely for the use of your organization you'd want to replace that with 
something like "org/my_organization/nutch".
+ For the example we're going to assume that this plugin is something we want 
to contribute back to the Nutch community, so we're going to use the 
directory/package structure of "org/apache/nutch".  If you're writing a plugin 
solely for the use of your organisation you'd want to replace that with 
something like "org/my_organization/nutch". If you look at the structure of the 
urlmeta plugin you will see it follows this convention e.g. 
org.apache.nutch.indexer and org.apache.nutch.scoring
  
  == Required Files ==
- You're going to need to create a directory inside of the plugin directory 
with the name of your plugin ('recommended' in this case) and inside that 
directory you need the following:
+ This section covers the integral components required to develop and use a 
plugin. As you can see inside the $NUTCH_HOME/src/plugin directory, the plugin 
folder urlmeta contains the following:
  
-  * A plugin.xml file that tells nutch about your plugin.
+  * A plugin.xml file that tells Nutch about your plugin.
   * A build.xml file that tells ant how to build your plugin.
   * The source code of your plugin in the directory structure 
recommended/src/java/org/apache/nutch/parse/recommended/[Source_Here].
  

Reply via email to