Re: ParseException: parser not found for contentType=image/bmp [or how to disallow certain contentTypes from fetching]

Dennis Kubes Mon, 15 Oct 2007 05:13:38 -0700

Marcin is correct about the .asp extension and the regex filter, butnutch is not downloading this as an image src. The page itselfhttp://0086jia.com/include/validCode.asp, returns an image with contenttype of bmp. It looks like a simple captcha to me. Since nutch can'tparse this type of content it throws an error and moves on. Itshouldn't stop the fetching process, it should just log the error andcontinue. AFAIK there is no way currently to filter content types,although that might be an interesting addition.


Dennis Kubes


Marcin Okraszewski wrote:

The regex filter just filters URL, not content types. As the URL ends with .asp 
it does not fall into the prohibited URL patterns. The problem is that Nutch 
fallows img/@src, so it downloads images. There is a patch for this under 
http://issues.apache.org/jira/browse/Nutch-488 which allows selecting tags to 
take for outlinks.

See more in this thread:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg06961.html

Regards,
Marcin


Dnia 15 października 2007 11:18 "eyal edri" <[EMAIL PROTECTED]> napisał(a):

Hello,

During a fetch, the fetcher failed to retrieve a certain page with the
following exception:

// url is masked ****
Error parsing: http://*********/validCode.asp:
org.apache.nutch.parse.ParseException: parser not found for
contentType=image/bmp url=http://0086jia.com/include/validCode.asp
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:81)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(
Fetcher.java:349)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java
:194)

i've configed both regex-urlfilter.txt;

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|wmv|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|
bmp|BMP|swf)$

and suffix-urlfilter.txt:

### prohibit these
# pictures
.gif
.jpg
.jpeg
.bmp
.png
.tif
.tiff

both plugins are in the nutch-site "plugin-include" property:


  plugin.includes
  protocol-http|urlfilter-regex|urlfilter-suffix|
parse-(text|html|js|zip)|query-(basic|site|url)|index-basic|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)


and my crawling is done by running: nutch inject/generate/fetch loops.

Am i missing some property i should config  in  order to avoid
fetching/crawling contentTypes i don't to? (same goes for xml/jpeg... and
other filetypes).

Thanks!

Eyal.

Re: ParseException: parser not found for contentType=image/bmp [or how to disallow certain contentTypes from fetching]

Reply via email to