Hi - sites such as nytimes are hard to crawl. The only way to work around the redirect problem is to identify why it does so and then have Nutch send the appropriate HTTP headers so it won't. It may be a cookie, or a browser-like user-agent string. AFAIK Nutch has no facility yet to send arbitrary HTTP headers, certainly not a per-host set of headers.
Markus -----Original message----- From: Yann Levreau<[email protected]> Sent: Monday 16th June 2014 19:18 To: [email protected] Subject: Re: nutch elpais.com Youre right, I need to clean these config files. I think these plugins came from Nutch 1.7 (bad copy/paste :) ) I have news with my issue. Actually there were two issues : 1) outlinks are not set in the WebPage : In ParseUtil.java (line195), we have : if (ParseStatusUtils.isSuccess(pstatus)) { if (pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) { String newUrl = ParseStatusUtils.getMessage(pstatus); int refreshTime = Integer.parseInt(ParseStatusUtils.getArg(pstatus, 1)); In case if ParseStatusCodes.SUCCESS_REDIRECT is 100, outlinks are not set into the WebPage even if outlinks are in the parse. This is due to the line 219 in HtmlParser.java : ParseStatus status = new ParseStatus(); status.setMajorCode(ParseStatusCodes.SUCCESS); if (metaTags.getRefresh()) { -----> status.setMinorCode(ParseStatusCodes.SUCCESS_REDIRECT); <------ status.addToArgs(new Utf8(metaTags.getRefreshHref().toString())); status.addToArgs(new Utf8(Integer.toString(metaTags.getRefreshTime()))); } Replacing ParseStatusCodes.SUCCESS_REDIRECT with ParseStatusCodes.SUCCESS correct the behavior of ParseUtil.java. But Maybe Im wrong to do this : ParseStatusCodes.SUCCESS_REDIRECT is probably here for a good reason. 2) with www.nytimes.com <http://www.nytimes.com> : Web pages are redirection. For example, http://www.nytimes.com/2014/06/17/world/middleeast/iraq.html <http://www.nytimes.com/2014/06/17/world/middleeast/iraq.html> leads to http://www.nytimes.com/glogin?URI=http%3A%2F%2Fwww.nytimes.com%2F2014%2F06%2F17%2Fworld%2Fmiddleeast%2Firaq.html%3F_r%3D0 <http://www.nytimes.com/glogin?URI=http%3A%2F%2Fwww.nytimes.com%2F2014%2F06%2F17%2Fworld%2Fmiddleeast%2Firaq.html%3F_r%3D0> wich leads to http://www.nytimes.com/2014/06/17/world/middleeast/iraq.html?_r=0 <http://www.nytimes.com/2014/06/17/world/middleeast/iraq.html?_r=0> and so on. We never get the content of this page. But may be this is by design and there is a better way to crawl this site .... Im sorry to send in this mailing, maybe this is the wrong place. This is just in case some of you had the same issue. Thanks a lot ! Yann 2014-06-16 2:13 GMT-07:00 Julien Nioche <[email protected] <mailto:[email protected]>>: Salut Yann, Not really answering your question but where did you get this config from? Some of its elements have been long deprecated (query-*, response-*, summary-*) Julien On 15 June 2014 10:20, Yann Levreau <[email protected] <mailto:[email protected]>> wrote: hi everyone ! Im sorry to disturb you but i need some assistance for getting the outlinks of http://elpais.com <http://elpais.com>. I use Nutch 2.2.1. The web page is well parsed, in debug I have all the outlinks in the Parse object. I use these basic plugins : protocol-http|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) But outlinks are never injected in hbase (with http://elpais.com <http://elpais.com> or http://www.elpais.com <http://www.elpais.com>). If i try to parse www.nytimes.com <http://www.nytimes.com>, outlinks are normally injected and added to the fetch list. Any idea ? Thanks Yann ==> I have the same issue with http://www.lemonde.fr <http://www.lemonde.fr> -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/> http://www.digitalpebble.com <http://www.digitalpebble.com> http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>

