> By investing further, I've found that for parse-html, the links are > extracted differently: the links are returned by > DOMContentUtils.getOutlinks based upon Neko, which therefore makes me > wonder how you get to extract links with OutlinkExtractor instead...
Earl, which Nutch version do you use? If your links are extracted by the OutlinkExtractor, it seems that it is not the HtmlParser that is used to parse your document, but the TextParser instead (the default one). There were a bug concerning the content-types with parameters such as "text/html; charset=iso-8859-1". Moreover your site return such a content-type, so that the ParserFactory doesn't correctly find the good parser (HtmlParser), but uses the default one. This issue is fixed in trunk and mapred. For further details, see the thread on nutch-dev mailing list : http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00791.html or the NUTCH-88 issue : http://issues.apache.org/jira/browse/NUTCH-88 Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
