You set that up in your nutch-site.xml file. Open the
nutch-default.xml file (located in the <NUTCH_INSTALL_DIR>/conf. Look
for this element:

<property>
 <name>plugin.includes</name>
<value>protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
 <description>Regular expression naming plugin directory names to
 include.  Any plugin not matching this expression is excluded.
 In any case you need at least include the nutch-extensionpoints plugin. By
 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins. In order to use HTTPS please enable
 protocol-httpclient, but be aware of possible intermittent problems with the
 underlying commons-httpclient library.
 </description>
</property>


You'll notice the "parse" plugins that uses the regex
"parse-(text|html|pdf|msword|rss)".  You remove/add the available
parsers here. So, if you only wanted pdfs, you only use the pdf
parser, "parse-(pdf)" or just "parse-pdf".

Don't edit the nutch-default file. Create a new nutch-site.xml file
for your cusomizations.  So, basically copy the nutch-default.xml
file, remove everything you don't need to override, and there ya go.

I believe that is the correct way.


On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]> wrote:


hi!

I have a question. If I have for example the seed urls and do a crawl based o
that seeds. If I want to index then only pages that contain for example pdf
documents, how can I do that?

cheers
martin





--
"Conscious decisions by conscious minds are what make reality real"

Reply via email to