Re: Nutch fetching skipped files

Susam Pal Thu, 03 Apr 2008 10:05:17 -0700

Find my reply inline.

On Wed, Apr 2, 2008 at 5:04 PM, Vineet Garg <[EMAIL PROTECTED]> wrote:
> Hi,
>  I am using Nutch to crawl local file system. I am crawling by  bin/nutch
> crawl urls -dir crawl -depth 5 -topN 500 > & crawl.log.
>  But nutch is fetching files e.g. .css or .png files which i have set to be
> skipped in crawl-urlfilter.txt file and throwing error while parsing:
>
>  fetching file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden
>  fetching file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css
>  fetching file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden
>  fetching
> file:/hm/vineetg/SPD38/share/doc/spwlibcomm/spwlibcommPreface_2.html
>  fetching file:/hm/vineetg/SPD38/share/doc/spweul/images/
>  fetching file:/hm/vineetg/SPD38/share/doc/spwfds/wwhdata/
>  fetching
> file:/hm/vineetg/SPD38/share/doc/spw2xml/DBMigrator_methodology_5.html
>  fetching
> file:/hm/vineetg/SPD38/share/doc/spwtutorial_advanced/spwtutorial_advancedPreface_4.html
>  fetching file:/hm/vineetg/SPD38/share/doc/spwcsc/title_1.html
>  fetching file:/hm/vineetg/SPD38/share/doc/spwveis136/images/
>  Error parsing: file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden:
> failed(2,200): org.apache.nutch.parse.ParseException: parser not found for
> contentType= url=file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden
>  fetching file:/hm/vineetg/SPD38/share/doc/spwlibwlan/chap1_6.html
>  Error parsing: file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden:
> failed(2,200): org.apache.nutch.parse.ParseException: parser not found for
> contentType= url=file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden
>  fetching file:/hm/vineetg/SPD38/libraries/cdma_rtl/fir/
>  Error parsing: file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css:
> failed(2,200): org.apache.nutch.parse.ParseException: parser not found for
> contentType=text/css
> url=file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css
>
>
>  my crawl-urlfilter file is:# The url filter file used by the crawl command.
>
>  # Better for intranet crawling.
>  # Be sure to change MY.DOMAIN.NAME to your domain name.
>
>  # Each non-comment, non-blank line contains a regular expression
>  # prefixed by '+' or '-'.  The first matching pattern in the file
>  # determines whether a URL is included or ignored.  If no pattern
>  # matches, the URL is ignored.
>
>  # skip http:, ftp:, & mailto: urls
>  #-^(http|ftp|mailto):
>  +^(file|ftp|mailto):


You have allowed URLs beginning with "file:". Since, this is the first
regular expression that matches with the URLs being crawled, the rest
of the crawl-urlfilter.txt is ignored. If you read the comments in
this file, you'll find that it says, "The first matching pattern in
the file determines whether a URL is included or ignored."

Hope this helps.

Regards,
Susam Pal

>
>
>
>  # skip image and other suffixes we can't yet parse
>
> -\.(css|gif|GIF|jpg|JPG|png|PNG|ico|ICO|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
>  What could be the reason??
>
>  Regards,
>  Vineet
>

Re: Nutch fetching skipped files

Reply via email to