[Nutch Wiki] Update of "WritingPlugins" by JakeVanderdray

Apache Wiki Sun, 29 Jan 2006 10:06:26 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by JakeVanderdray:
http://wiki.apache.org/nutch/WritingPlugins

------------------------------------------------------------------------------
   * 
[http://lucene.apache.org/nutch/apidocs/org/apache/nutch/net/URLFilter.html 
URLFilter] -- URLFilter implementations limit the URLs that nutch attempts to 
fetch.  The 
[http://lucene.apache.org/nutch/apidocs/org/apache/nutch/net/RegexURLFilter.html
 RegexURLFilter] distributed with Nutch provides a great deal of control over 
what URLs Nutch crawls, however if you have very complicated rules about what 
URLs you want to crawl, you can write your own implementation.
   * 
[http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/NutchAnalyzer.java?view=markup
 NutchAnalyzer] -- An extension point that provides some language specific 
analyzers (see MultiLingualSupport proposal). ''Since it is in development 
stage, it is not in released javadoc''.
  
- == The Example ==
- 
- Consider this as a plugin example: We want to be able to recommend specific 
web pages for given search terms.  For this example we'll assume we're indexing 
this site.  As you may have noticed, there are a number of pages that talk 
about plugins.  What we want to do is have it so that if someone searches for 
the term "plugin" we recommend that they start at the PluginCentral page, but 
we also want to return all the normal hits in the expected ranking.  We'll 
seperate the search results page into a section of recommendations and then a 
section with the normal search results.
- 
- You go through your site and add meta-tags to pages that list what terms they 
should be recommended for.  The tags look something like this:
- 
- {{{
- <meta name="recommended" content="plugins" />
- }}}
- 
- In order to do this we need to write a plugin that extends 3 different 
extension points.  We need to extend the HTMLParser in order to get the 
recommended terms out of the meta tags.  The !IndexingFilter will need to be 
extended to add a recommended field to the index.  The !QueryFilter needs to be 
extended to add the ability to search againsed the new field in the index.
- 
  == Setup ==
  
  Start by [http://www.apache.org/dev/version-control.html#anon-svn 
downloading] the Nutch source code.  Once you've got that make sure it compiles 
as is before you make any changes.  You should be able to get it to compile by 
running ant from the directory you downloaded the source to.  If you have 
trouble you can write to one of the [wiki:Mailing Mailing Lists].
  
  Use the source code for the plugins distrubuted with Nutch as a reference.  
They're in [!YourCheckoutDir]/src/plugin.
  
- For the example we're going to assume that this plugin is something we want 
to contribute back to the Nutch community, so we're going to use the 
directory/package structure of "org/apache/nutch".  If you're writing a plugin 
solely for the use of your organization you'd want to replace that with 
something like "org/my_organization/nutch".
- 
  == Required Files ==
  
- You're going to need to create a directory inside of the plugin directory 
with the name of your plugin ('recommended' in this case) and inside that 
directory you need the following:
+ You're going to need to create a directory inside of the plugin directory 
with the name of your plugin. Inside that directory you need the following:
  
   * A plugin.xml file that tells nutch about your plugin.
   * A build.xml file that tells ant how to build your plugin.
-  * The source code of your plugin in the directory structure 
recommended/src/java/org/apache/nutch/parse/recommended/[Source_Here].
+  * The source code of your plugin.
  
  == Plugin.xml ==
  
+ The plugin.xml file describes your plugin including the names of your 
extensions and specificly what they're extending.
- Your plugin.xml file should look something like this:
- 
- {{{
- <?xml version="1.0" encoding="UTF-8"?>
- <plugin
-    id="recommended"
-    name="Recommended Parser/Filter"
-    version="0.0.1"
-    provider-name="nutch.org">
- 
-     <runtime>
-       <library name="recommended.jar">
-          <export name="*"/>
-       </library>
-    </runtime>
- 
-    <extension id="org.apache.nutch.parse.recommended.recommendedfilter"
-               name="Recommended Parser"
-               point="org.apache.nutch.parse.HtmlParseFilter">
-       <implementation id="RecommendedParser"
-                       
class="org.apache.nutch.parse.recommended.RecommendedParser"/>
-    </extension>
- 
-    <extension id="org.apache.nutch.parse.recommended.recommendedindexer"
-               name="Recommended identifier filter"
-               point="org.apache.nutch.indexer.IndexingFilter">
-       <implementation id="RecommendedIndexer"
-                       
class="org.apache.nutch.parse.recommended.RecommendedIndexer"/>
-    </extension>
- 
-    <extension id="org.apache.nutch.parse.recommended.recommendedSearcher"
-               name="Recommended Search Query Filter"
-               point="org.apache.nutch.searcher.QueryFilter">
-       <implementation id="RecommendedQueryFilter"
-                       
class="org.apache.nutch.parse.recommended.RecommendedQueryFilter"
-                       raw-fields="recommended"/>
-    </extension>
- 
- </plugin>
- }}}
  
  == Build.xml ==
  
+ Tells ant how to build your plugin.
- In its simplest form:
- 
- {{{
- <?xml version="1.0"?>
- 
- <project name="recommended" default="jar">
- 
-   <import file="../build-plugin.xml"/>
- 
- </project>
- }}}
- 
- == The HTML Parser Extension ==
- 
- This is the source code for the HTML Parser extension.  It tries to grab the 
contents of the recommended meta tag and add them to the document being parsed.
- 
- {{{
- package org.apache.nutch.parse.recommended;
- 
- // JDK imports
- import java.util.Enumeration;
- import java.util.Properties;
- import java.util.logging.Logger;
- 
- // Nutch imports
- import org.apache.nutch.parse.HTMLMetaTags;
- import org.apache.nutch.parse.Parse;
- import org.apache.nutch.parse.HtmlParseFilter;
- import org.apache.nutch.protocol.Content;
- import org.apache.nutch.util.LogFormatter;
- 
- public class RecommendedParser implements HtmlParseFilter {
- 
-   private static final Logger LOG = LogFormatter
-     .getLogger(RecommendedParser.class.getName());
- 
-   /** The Recommended meta data attribute name */
-   public static final String META_RECOMMENDED_NAME="Recommended";
- 
-   /**
-    * Scan the HTML document looking for a recommended meta tag.
-    */
-   public Parse filter(Content content, Parse parse, HTMLMetaTags metaTags, 
DocumentFragment doc) {
-     // Trying to find the document's recommended term
-       String recommendation = null;
- 
-       Properties generalMetaTags = metaTags.getGeneralTags();
-       
-       for (Enumeration tagNames = generalMetaTags.propertyNames(); 
tagNames.hasMoreElements(); ) {
-                       if (tagNames.nextElement().equals("recommended")) {
-                               recommendation = 
generalMetaTags.getProperty("recommended");
-                               LOG.info("Found a Recommendation for " + 
recommendation);
-                       }
-       }
- 
-       if (recommendation == null) {
-                       LOG.info("No Recommendataion");
-       } else {
-                       LOG.info("Adding Recommendation for " + recommendation);
-               parse.getData().getMetadata().put(META_RECOMMENDED_NAME, 
recommendation);
-       }
- 
-     return parse;
-   }
- }
- }}}
- 
- == The Indexer Extension ==
- 
- The following is the code for the Indexing Filter extension.  If the document 
being indexed had a recommended meta tag this extension adds a lucene text 
field to the index called "recommended" with the content of that meta tag.
- 
- {{{
- package org.apache.nutch.parse.recommended;
- 
- // JDK import
- import java.util.logging.Logger;
- 
- // Nutch imports
- import org.apache.nutch.util.LogFormatter;
- import org.apache.nutch.fetcher.FetcherOutput;
- import org.apache.nutch.indexer.IndexingFilter;
- import org.apache.nutch.indexer.IndexingException;
- import org.apache.nutch.parse.Parse;
- 
- // Lucene imports
- import org.apache.lucene.document.Field;
- import org.apache.lucene.document.Document;
- 
- public class RecommendedIndexer implements IndexingFilter {
-   public static final Logger LOG
-     = LogFormatter.getLogger(RecommendedIndexer.class.getName());
-   
-   public RecommendedIndexer() {
-   }
- 
-   public Document filter(Document doc, Parse parse, FetcherOutput fo)
-     throws IndexingException {
- 
-     String recommendation = parse.getData().get("Recommended");
- 
-       if (recommendation != null) {
-                       Field recommendedField = Field.Text("recommended", 
recommendation);
-                       recommendedField.setBoost(5.0f);
-               doc.add(recommendedField);
-                       LOG.info("Added " + recommendation + " to the 
recommended Field");
-       }
- 
-     return doc;
-   }
- }
- }}}
- 
- == The QueryFilter ==
- 
- [Needs to be added]
  
  == Getting Nutch to Use Your Plugin ==
  
- In order to get Nutch to use your plugin, you need to edit your 
conf/nutch-site.xml file and add in a block like this:
+ In order to get Nutch to use your plugin, you need to edit your 
conf/nutch-site.xml file and add the name of your plugin to the list of 
plugin.includes.
  
+ == Compiling ==
- {{{
- <property>
-   <name>plugin.includes</name>
-   
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
-   <description>Regular expression naming plugin directory names to
-   include.  Any plugin not matching this expression is excluded.
-   In any case you need at least include the nutch-extensionpoints plugin. By
-   default Nutch includes crawling just HTML and plain text via HTTP,
-   and basic indexing and search plugins.
-   </description>
- </property>
- }}}
- 
- You'll want to edit the regular expression so that it includes the name of 
your plugin.
- 
- {{{
-   
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|recommended</value>
- }}}
  
  Running 'ant' in the root of your checkout directory should get everything 
compiled and jared up.  The next time you run a crawl your parser and index 
filter should get used.
+ 
+ == Deploying ==
  
  You'll need to run 'ant war' to compile a new ROOT.war file.  Once you've 
deployed that, your query filter should get used when searches are performed.

[Nutch Wiki] Update of "WritingPlugins" by JakeVanderdray

Reply via email to