Marcin is correct about the .asp extension and the regex filter, but
nutch is not downloading this as an image src. The page itself
http://0086jia.com/include/validCode.asp, returns an image with content
type of bmp. It looks like a simple captcha to me. Since nutch can't
parse this type of content it throws an error and moves on. It
shouldn't stop the fetching process, it should just log the error and
continue. AFAIK there is no way currently to filter content types,
although that might be an interesting addition.
Dennis Kubes
Marcin Okraszewski wrote:
The regex filter just filters URL, not content types. As the URL ends with .asp
it does not fall into the prohibited URL patterns. The problem is that Nutch
fallows img/@src, so it downloads images. There is a patch for this under
http://issues.apache.org/jira/browse/Nutch-488 which allows selecting tags to
take for outlinks.
See more in this thread:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg06961.html
Regards,
Marcin
Dnia 15 października 2007 11:18 "eyal edri" <[EMAIL PROTECTED]> napisał(a):
Hello,
During a fetch, the fetcher failed to retrieve a certain page with the
following exception:
// url is masked ****
Error parsing: http://*********/validCode.asp:
org.apache.nutch.parse.ParseException: parser not found for
contentType=image/bmp url=http://0086jia.com/include/validCode.asp
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:81)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(
Fetcher.java:349)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java
:194)
i've configed both regex-urlfilter.txt;
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|wmv|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|
bmp|BMP|swf)$
and suffix-urlfilter.txt:
### prohibit these
# pictures
.gif
.jpg
.jpeg
.bmp
.png
.tif
.tiff
both plugins are in the nutch-site "plugin-include" property:
plugin.includes
protocol-http|urlfilter-regex|urlfilter-suffix|
parse-(text|html|js|zip)|query-(basic|site|url)|index-basic|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
and my crawling is done by running: nutch inject/generate/fetch loops.
Am i missing some property i should config in order to avoid
fetching/crawling contentTypes i don't to? (same goes for xml/jpeg... and
other filetypes).
Thanks!
Eyal.