Follow these steps for nutch-0.7.2: (1) Modify the nutch-default.xml for the following property For ex: if you want to include "doc" file type, replace the <value> node to "parse-(text|html|doc)" as shown below.
<property> <name>plugin.includes</name> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|doc)|index-basic|query-(basic|site|url)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property> (2) The next step is to develop the appropriate plugin for the particular file. The parse needs to implement the interface "Parser" ( org.apache.nutch.parse )in nutch. More details can be found in the following link http://wiki.apache.org/nutch/WritingPluginExample (3) Modify the plugin.xml. The link above describes everything in detail. Here is an example plugin.xml I wrote for XHTML parser. Observe the "contentType" which matches the file type you are trying to parse. <?xml version="1.0" encoding="UTF-8"?> <plugin id="parse-xhtml" name="Xhtml Parse Plug-in" version="1.0.0" provider-name="dessci.com"> <runtime> <library name="parse-xhtml.jar"> <export name="*"/> </library> <library name="nekohtml-0.9.4.jar"/> <library name="tagsoup-1.0rc3.jar"/> </runtime> <extension id="com.dessci.search.nutch.parse.xhtml" name="XhtmlParse" point="org.apache.nutch.parse.Parser"> <implementation id="com.dessci.search.nutch.parse.xhtml.XhtmlParser" class="com.dessci.search.nutch.parse.xhtml.XhtmlParser " contentType="application/xhtml+xml" pathSuffix=""/> </extension> </plugin> Hope this helps, --Rajesh Munavalli On 4/11/06, bob knob <[EMAIL PROTECTED]> wrote: > > Hi, it's me again, > > If I'm going to use Nutch, I need xls, ppt, & doc file > types to be searchable if at all possible. The wiki > says most file types are disabled by default, but they > can be turned on by changing conf/nutch-site.xml. > Unfortunately there is no documentation that I can > find for this file... any ideas how to do it, or > sample xml that somebody could send over? > > Thanks, > Bob > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com >
