Hi Sébastien,
Yahoo! just hosed my message, glad I had it elsewhere.
As you probably saw in the OutlinkExtractor class,
the links are
extracted with a Regexp.
Ahh, didn't see it before, but I now see URL_PATTERN.
I know it's minor, but if you later apply
By investing further, I've found that for parse-html, the links are
extracted differently: the links are returned by
DOMContentUtils.getOutlinks based upon Neko, which therefore makes me
wonder how you get to extract links with OutlinkExtractor instead...
Earl,
which Nutch version do you
Jérôme,
which Nutch version do you use?
Kind of gave up on mapred for awhile, so I am using
trunk.
There were a bug concerning the content-types with
parameters such as
text/html; charset=iso-8859-1.
Yeah, when I telnet in to GET / shopthar.com, I get
Content-Type: text/html;