The regex filter just filters URL, not content types. As the URL ends with .asp 
it does not fall into the prohibited URL patterns. The problem is that Nutch 
fallows img/@src, so it downloads images. There is a patch for this under 
http://issues.apache.org/jira/browse/Nutch-488 which allows selecting tags to 
take for outlinks.

See more in this thread:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg06961.html

Regards,
Marcin


Dnia 15 października 2007 11:18 "eyal edri" <[EMAIL PROTECTED]> napisał(a):

> Hello,
> 
> During a fetch, the fetcher failed to retrieve a certain page with the
> following exception:
> 
> // url is masked ****
> Error parsing: http://*********/validCode.asp:
> org.apache.nutch.parse.ParseException: parser not found for
> contentType=image/bmp url=http://0086jia.com/include/validCode.asp
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:81)
>         at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(
> Fetcher.java:349)
>         at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java
> :194)
> 
> i've configed both regex-urlfilter.txt;
> 
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|wmv|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|
> bmp|BMP|swf)$
> 
> and suffix-urlfilter.txt:
> 
> ### prohibit these
> # pictures
> .gif
> .jpg
> .jpeg
> .bmp
> .png
> .tif
> .tiff
> 
> both plugins are in the nutch-site "plugin-include" property:
> 
> 
>   plugin.includes
>   protocol-http|urlfilter-regex|urlfilter-suffix|
> parse-(text|html|js|zip)|query-(basic|site|url)|index-basic|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
> 
> 
> and my crawling is done by running: nutch inject/generate/fetch loops.
> 
> Am i missing some property i should config  in  order to avoid
> fetching/crawling contentTypes i don't to? (same goes for xml/jpeg... and
> other filetypes).
> 
> Thanks!
> 
> Eyal.
> 

Reply via email to