RE: nutch elpais.com

Markus Jelsma Mon, 16 Jun 2014 14:23:33 -0700

Hi - sites such as nytimes are hard to crawl. The only way to work around the 
redirect problem is to identify why it does so and then have Nutch send the 
appropriate HTTP headers so it won't. It may be a cookie, or a browser-like 
user-agent string. AFAIK Nutch has no facility yet to send arbitrary HTTP 
headers, certainly not a per-host set of headers.

Markus

-----Original message-----
From: Yann Levreau<[email protected]>
Sent: Monday 16th June 2014 19:18
To: [email protected]
Subject: Re: nutch elpais.com

Youre right, I need to clean these config files. I think these plugins came 
from Nutch 1.7 (bad copy/paste :) )

I have news with my issue. Actually there were two issues  :

1) outlinks are not set in the WebPage :

In ParseUtil.java (line195), we have :

if (ParseStatusUtils.isSuccess(pstatus)) {
      if (pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) {
        String newUrl = ParseStatusUtils.getMessage(pstatus);

        int refreshTime = Integer.parseInt(ParseStatusUtils.getArg(pstatus, 1));

In case if ParseStatusCodes.SUCCESS_REDIRECT is 100, outlinks are not set into 
the WebPage even if outlinks are in the parse. This is due to the line 219 in 
HtmlParser.java :

ParseStatus status = new ParseStatus();
    status.setMajorCode(ParseStatusCodes.SUCCESS);
    if (metaTags.getRefresh()) {
      -----> status.setMinorCode(ParseStatusCodes.SUCCESS_REDIRECT); <------

      status.addToArgs(new Utf8(metaTags.getRefreshHref().toString()));
      status.addToArgs(new Utf8(Integer.toString(metaTags.getRefreshTime())));
    }

Replacing ParseStatusCodes.SUCCESS_REDIRECT with ParseStatusCodes.SUCCESS 
correct the behavior of ParseUtil.java. 
But Maybe Im wrong to do this : ParseStatusCodes.SUCCESS_REDIRECT is probably 
here for a good reason.

2) with www.nytimes.com <http://www.nytimes.com> :

Web pages are redirection. For example,

http://www.nytimes.com/2014/06/17/world/middleeast/iraq.html 
<http://www.nytimes.com/2014/06/17/world/middleeast/iraq.html>

leads to
http://www.nytimes.com/glogin?URI=http%3A%2F%2Fwww.nytimes.com%2F2014%2F06%2F17%2Fworld%2Fmiddleeast%2Firaq.html%3F_r%3D0

<http://www.nytimes.com/glogin?URI=http%3A%2F%2Fwww.nytimes.com%2F2014%2F06%2F17%2Fworld%2Fmiddleeast%2Firaq.html%3F_r%3D0>

wich leads to
http://www.nytimes.com/2014/06/17/world/middleeast/iraq.html?_r=0 
<http://www.nytimes.com/2014/06/17/world/middleeast/iraq.html?_r=0>

and so on.

We never get the content of this page. But may be this is by design and there 
is a better way to crawl this site ....

Im sorry to send in this mailing, maybe this is the wrong place. This is just 
in case some of you had the same issue.

Thanks a lot !

Yann

2014-06-16 2:13 GMT-07:00 Julien Nioche <[email protected] 
<mailto:[email protected]>>:

Salut Yann,

Not really answering your question but where did you get this config from? Some 
of its elements have been long deprecated (query-*, response-*, summary-*)

Julien

On 15 June 2014 10:20, Yann Levreau <[email protected] 
<mailto:[email protected]>> wrote:

hi everyone !

Im sorry to disturb you but i need some assistance for getting the outlinks of 
http://elpais.com <http://elpais.com>.

I use Nutch 2.2.1.

The web page is well parsed, in debug I have all the outlinks in the Parse 
object.

I use these basic plugins :

protocol-http|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)

But outlinks are never injected in hbase (with http://elpais.com 
<http://elpais.com> or http://www.elpais.com <http://www.elpais.com>).

If i try to parse www.nytimes.com <http://www.nytimes.com>, outlinks are 
normally injected and added to the fetch list.

Any idea ?

Thanks
Yann

==> I have the same issue with http://www.lemonde.fr <http://www.lemonde.fr>

--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/>

http://www.digitalpebble.com <http://www.digitalpebble.com>
http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>

RE: nutch elpais.com

Reply via email to