Nutch fetching skipped files

Vineet Garg Wed, 02 Apr 2008 04:34:39 -0700

Hi,

I am using Nutch to crawl local file system. I am crawling by bin/nutchcrawl urls -dir crawl -depth 5 -topN 500 > & crawl.log.But nutch is fetching files e.g. .css or .png files which i have set tobe skipped in crawl-urlfilter.txt file and throwing error while parsing:


fetching file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden
fetching file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css
fetching file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden

fetchingfile:/hm/vineetg/SPD38/share/doc/spwlibcomm/spwlibcommPreface_2.html

fetching file:/hm/vineetg/SPD38/share/doc/spweul/images/
fetching file:/hm/vineetg/SPD38/share/doc/spwfds/wwhdata/

fetchingfile:/hm/vineetg/SPD38/share/doc/spw2xml/DBMigrator_methodology_5.htmlfetchingfile:/hm/vineetg/SPD38/share/doc/spwtutorial_advanced/spwtutorial_advancedPreface_4.html

fetching file:/hm/vineetg/SPD38/share/doc/spwcsc/title_1.html
fetching file:/hm/vineetg/SPD38/share/doc/spwveis136/images/

Error parsing: file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden:failed(2,200): org.apache.nutch.parse.ParseException: parser not foundfor contentType= url=file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden

fetching file:/hm/vineetg/SPD38/share/doc/spwlibwlan/chap1_6.html

Error parsing:file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden:failed(2,200): org.apache.nutch.parse.ParseException: parser not foundfor contentType=url=file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden

fetching file:/hm/vineetg/SPD38/libraries/cdma_rtl/fir/

Error parsing: file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css:failed(2,200): org.apache.nutch.parse.ParseException: parser not foundfor contentType=text/cssurl=file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css



my crawl-urlfilter file is:# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip http:, ftp:, & mailto: urls
#-^(http|ftp|mailto):
+^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(css|gif|GIF|jpg|JPG|png|PNG|ico|ICO|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to breakloops

-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^file*:///hm/vineetg/SPD38/libraries/([a-zA-Z0-9]*\.)
+^file*:///hm/vineetg/SPD38/share/doc/([a-zA-Z0-9]*\.)
# skip everything else
-.

nutch-site.xml :
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
       <name>http.agent.name</name>
       <value>ESL</value>
       <description></description>
</property>

<property>
 <name>http.agent.description</name>
 <value>MyDescription</value>
 <description></description>
</property>

<property>
  <name>http.agent.url</name>
  <value>myurlcom</value>
  <description></description>
</property>

<property>
         <name>http.agent.email</name>
         <value>[EMAIL PROTECTED]</value>
         <description></description>
</property>


<property>
       <name>plugin.includes</name>

       <description></description>
</property>

<property>
       <name>plugin.folders</name>
       <value>/hm/vineetg/nutch-0.9/plugins</value>
       <description></description>
</property>

<property>
  <name>file.content.limit</name>
  <value>-1</value>
  <description>the length for downloaded content</description>
</property>

</configuration>


What could be the reason??

Regards,
Vineet

Nutch fetching skipped files

Reply via email to