Follow these steps for nutch-0.7.2:

(1) Modify the nutch-default.xml for the following property
For ex: if you want to include "doc" file type, replace the <value> node to
"parse-(text|html|doc)" as shown below.

<property>
  <name>plugin.includes</name>

<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|doc)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

(2) The next step is to develop the appropriate plugin for the particular
file. The parse needs to implement the interface "Parser" (
org.apache.nutch.parse )in nutch.

More details can be found in the following link
http://wiki.apache.org/nutch/WritingPluginExample

(3) Modify the plugin.xml. The link above describes everything in detail.
Here is an example plugin.xml I wrote for XHTML parser. Observe the
"contentType" which matches the file type you are trying to parse.

<?xml version="1.0" encoding="UTF-8"?>
<plugin id="parse-xhtml" name="Xhtml Parse Plug-in" version="1.0.0"
provider-name="dessci.com">

    <runtime>
      <library name="parse-xhtml.jar">
         <export name="*"/>
      </library>
      <library name="nekohtml-0.9.4.jar"/>
      <library name="tagsoup-1.0rc3.jar"/>
   </runtime>

   <extension id="com.dessci.search.nutch.parse.xhtml"
              name="XhtmlParse"
              point="org.apache.nutch.parse.Parser">

      <implementation id="com.dessci.search.nutch.parse.xhtml.XhtmlParser"
                      class="com.dessci.search.nutch.parse.xhtml.XhtmlParser
"
                      contentType="application/xhtml+xml"
                      pathSuffix=""/>

   </extension>

</plugin>



Hope this helps,

--Rajesh Munavalli
On 4/11/06, bob knob <[EMAIL PROTECTED]> wrote:
>
> Hi, it's me again,
>
> If I'm going to use Nutch, I need xls, ppt, & doc file
> types to be searchable if at all possible. The wiki
> says most file types are disabled by default, but they
> can be turned on by changing conf/nutch-site.xml.
> Unfortunately there is no documentation that I can
> find for this file... any ideas how to do it, or
> sample xml that somebody could send over?
>
> Thanks,
> Bob
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Reply via email to