Hello Vinet,

Try using regex-urlfilter instead of crawl-urlfilter.

Regards,

Arkadi

> -----Original Message-----
> From: Vineet Garg [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, April 02, 2008 10:34 PM
> To: [email protected]
> Subject: Nutch fetching skipped files
> 
> Hi,
> I am using Nutch to crawl local file system. I am crawling by
bin/nutch
> crawl urls -dir crawl -depth 5 -topN 500 > & crawl.log.
> But nutch is fetching files e.g. .css or .png files which i have set
to
> be skipped in crawl-urlfilter.txt file and throwing error while
parsing:
> 
> fetching file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden
> fetching file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css
> fetching file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden
> fetching
> file:/hm/vineetg/SPD38/share/doc/spwlibcomm/spwlibcommPreface_2.html
> fetching file:/hm/vineetg/SPD38/share/doc/spweul/images/
> fetching file:/hm/vineetg/SPD38/share/doc/spwfds/wwhdata/
> fetching
> file:/hm/vineetg/SPD38/share/doc/spw2xml/DBMigrator_methodology_5.html
> fetching
>
file:/hm/vineetg/SPD38/share/doc/spwtutorial_advanced/spwtutorial_advanc
ed
> Preface_4.html
> fetching file:/hm/vineetg/SPD38/share/doc/spwcsc/title_1.html
> fetching file:/hm/vineetg/SPD38/share/doc/spwveis136/images/
> Error parsing: file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden:
> failed(2,200): org.apache.nutch.parse.ParseException: parser not found
> for contentType=
url=file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden
> fetching file:/hm/vineetg/SPD38/share/doc/spwlibwlan/chap1_6.html
> Error parsing:
> file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden:
> failed(2,200): org.apache.nutch.parse.ParseException: parser not found
> for contentType=
> url=file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden
> fetching file:/hm/vineetg/SPD38/libraries/cdma_rtl/fir/
> Error parsing:
file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css:
> failed(2,200): org.apache.nutch.parse.ParseException: parser not found
> for contentType=text/css
> url=file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css
> 
> 
> my crawl-urlfilter file is:# The url filter file used by the crawl
> command.
> 
> 
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
> 
> 
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
> 
> 
> # skip http:, ftp:, & mailto: urls
> #-^(http|ftp|mailto):
> +^(file|ftp|mailto):
> 
> 
> 
> 
> 
> 
> # skip image and other suffixes we can't yet parse
> -
>
\.(css|gif|GIF|jpg|JPG|png|PNG|ico|ICO|sit|eps|wmf|zip|ppt|mpg|xls|gz|rp
m|
> tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> 
> 
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
> 
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to
break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
> 
> 
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +^file*:///hm/vineetg/SPD38/libraries/([a-zA-Z0-9]*\.)
> +^file*:///hm/vineetg/SPD38/share/doc/([a-zA-Z0-9]*\.)
> # skip everything else
> -.
> 
> nutch-site.xml :
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> 
> <configuration>
> 
> 
> <property>
>         <name>http.agent.name</name>
>         <value>ESL</value>
>         <description></description>
> </property>
> 
> 
> 
> 
> 
> 
> 
> 
> <property>
>   <name>http.agent.description</name>
>   <value>MyDescription</value>
>   <description></description>
> </property>
> 
> 
> 
> 
> 
> 
> <property>
>    <name>http.agent.url</name>
>    <value>myurlcom</value>
>    <description></description>
> </property>
> 
> 
> 
> 
> 
> 
> <property>
>           <name>http.agent.email</name>
>           <value>[EMAIL PROTECTED]</value>
>           <description></description>
> </property>
> 
> 
> <property>
>         <name>plugin.includes</name>
> 
> <value>protocol-http|protocol-file|regex-urlfilter|parse-html|index-
> basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-
> (pass|regex|basic)</value>
>         <description></description>
> </property>
> 
> 
> 
> 
> 
> 
> <property>
>         <name>plugin.folders</name>
>         <value>/hm/vineetg/nutch-0.9/plugins</value>
>         <description></description>
> </property>
> 
> 
> 
> 
> 
> 
> <property>
>    <name>file.content.limit</name>
>    <value>-1</value>
>    <description>the length for downloaded content</description>
> </property>
> 
> 
> </configuration>
> 
> 
> What could be the reason??
> 
> Regards,
> Vineet


Reply via email to