Re: crawl problems (a bug/patch)

Jérôme Charron Thu, 20 Oct 2005 14:23:04 -0700

> By investing further, I've found that for parse-html, the links are
> extracted differently: the links are returned by
> DOMContentUtils.getOutlinks based upon Neko, which therefore makes me
> wonder how you get to extract links with OutlinkExtractor instead...


Earl,

which Nutch version do you use?
If your links are extracted by the OutlinkExtractor, it seems that it is not
the HtmlParser that is used to parse your document, but the TextParser
instead (the default one).
There were a bug concerning the content-types with parameters such as
"text/html; charset=iso-8859-1".
Moreover your site return such a content-type, so that the ParserFactory
doesn't correctly find the good parser (HtmlParser), but uses the default
one.
This issue is fixed in trunk and mapred.
For further details,
see the thread on nutch-dev mailing list :
http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00791.html
or the NUTCH-88 issue : http://issues.apache.org/jira/browse/NUTCH-88

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: crawl problems (a bug/patch)

Reply via email to