subject:"Re\: crawl problems \(a bug\/patch\)"

Re: crawl problems (a bug/patch)

2005-10-20 Thread Earl Cahill

Hi Sébastien, Yahoo! just hosed my message, glad I had it elsewhere. As you probably saw in the OutlinkExtractor class, the links are extracted with a Regexp. Ahh, didn't see it before, but I now see URL_PATTERN. I know it's minor, but if you later apply

Re: crawl problems (a bug/patch)

2005-10-20 Thread Jérôme Charron

By investing further, I've found that for parse-html, the links are extracted differently: the links are returned by DOMContentUtils.getOutlinks based upon Neko, which therefore makes me wonder how you get to extract links with OutlinkExtractor instead... Earl, which Nutch version do you

Re: crawl problems (a bug/patch)

2005-10-20 Thread Earl Cahill

Jérôme, which Nutch version do you use? Kind of gave up on mapred for awhile, so I am using trunk. There were a bug concerning the content-types with parameters such as text/html; charset=iso-8859-1. Yeah, when I telnet in to GET / shopthar.com, I get Content-Type: text/html;