Alexandre Demeyer created NUTCH-2075:
----------------------------------------
Summary: Generate will not choose URL marker distance NULL
Key: NUTCH-2075
URL: https://issues.apache.org/jira/browse/NUTCH-2075
Project: Nutch
Issue Type: Bug
Components: generator
Affects Versions: 2.3
Environment: Using HBase as back-end Storage
Reporter: Alexandre Demeyer
Priority: Minor
It appears that there is a bug about certain links where nutch erases all
markers and not only the inject, generate, fetch, parse, update markers but
also the distance marker.
For that reason, Nutch Generator doesn't check the validity of the marker
distance (check if it's null) and keep wrong links (without the distance
marker) in the GeneratorMapper.
I think it's in relation with the problem mention here :
[NUTCH-1930|https://issues.apache.org/jira/browse/NUTCH-1930].
This doesn't solved the problem which is all markers are erased (without any
reasons apparently ..).
In order to find a solution about stopping crawl with problematics URL, I
proposed this solution which is simply to avoid the URL when the distance
marker is NULL.
Example of links where the problem appears (put an http.content.limit highter
than the content-length PDF) :
http://www.annales.org/archives/x/marchal2.pdf
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)