Hi,

I am using nutch to crawl and index local filesystem.

*My url file is*:
file:///hm/vineetg/url/libraries/
file:///hm/vineetg/url/share/doc/

*crawl-urlfilter:*

# The url filter file used by the crawl command.
# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.
# skip http:, ftp:, & mailto: urls
#-^(http|ftp|mailto):
+^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(css|jpg|JPG|png|PNG|ico|ICO|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/
# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^file*:///hm/vineetg/url/libraries/([a-zA-Z0-9]*\.)
+^file*:///hm/vineetg/url/share/doc/([a-zA-Z0-9]*\.)
# skip everything else
-.


*nutch-site.xml:
*<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
       <name>http.agent.name</name>
       <value>ESL</value>
       <description></description>
</property>
<property>
 <name>http.agent.description</name>
 <value>MyDescription</value>
 <description></description>
</property>
<property>
  <name>http.agent.url</name>
  <value>myurlcom</value>
  <description></description>
</property>
<property>
         <name>http.agent.email</name>
         <value>[EMAIL PROTECTED]</value>
         <description></description>
</property>
<property>
       <name>plugin.includes</name>
<value>protocol-file|urlfilter-regex|urlfilter-suffix|parse-(html|gif)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
       <description></description>
</property>
<property>
       <name>plugin.folders</name>
       <value>/hm/vineetg/nutch-0.9/plugins</value>
       <description></description>
</property>



*Problems:*
1. Nutch is fetching and indexing files in vineetg and url dir too i.e. nutch is crawling parent dir too. I dont want nutch to crawl parent dir. I want only child directories to be crawled. 2. Nutch is throwing a ParserException error while parsing .gif files. Do i have to include some parser for this? if yes how to include.

Does anybody know its solution??

Regards,
Vineet Garg


Reply via email to