[
https://issues.apache.org/jira/browse/NUTCH-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-2075:
-----------------------------------
Fix Version/s: 2.5
> Generate will not choose URL without distance marker
> ----------------------------------------------------
>
> Key: NUTCH-2075
> URL: https://issues.apache.org/jira/browse/NUTCH-2075
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 2.3
> Environment: Using HBase as back-end Storage
> Reporter: Alexandre Demeyer
> Priority: Minor
> Labels: newbie, patch, performance
> Fix For: 2.5
>
>
> It appears that there is a bug about certain links where nutch erases all
> markers and not only the inject, generate, fetch, parse, update markers but
> also the distance marker.
> The problem is that Nutch Generator doesn't check the validity of the marker
> distance (check if it's null) and keep wrong links (without the distance
> marker) in the GeneratorMapper. When the distance filter is activated,
> GeneratorMapper choose also URL without markers and so it doesn't repect the
> limit.
> I think it's in relation with the problem mention here :
> [NUTCH-1930|https://issues.apache.org/jira/browse/NUTCH-1930].
> This patch doesn't solved the problem which is all markers are erased
> (without any reasons apparently ..). But it can allow to stop the crawl...
> In order to find a solution about stopping crawl with problematics URL, I
> proposed this solution which is simply to avoid the URL when the distance
> marker is NULL.
> (Sorry if i put the code here)
> {code:title=crawl/GeneratorMapper.java (initial code)|borderStyle=solid}
> // filter on distance
> if (maxDistance > -1) {
> CharSequence distanceUtf8 =
> page.getMarkers().get(DbUpdaterJob.DISTANCE);
> if (distanceUtf8 != null) {
> int distance = Integer.parseInt(distanceUtf8.toString());
> if (distance > maxDistance) {
> return;
> }
> }
> }
> {code}
> {code:title=crawl/GeneratorMapper.java (patch code)|borderStyle=solid}
> // filter on distance
> if (maxDistance > -1) {
> CharSequence distanceUtf8 =
> page.getMarkers().get(DbUpdaterJob.DISTANCE);
> if (distanceUtf8 != null) {
> int distance = Integer.parseInt(distanceUtf8.toString());
> if (distance > maxDistance) {
> return;
> }
> }
> else
> {
> // No distance marker, URL problem
> return;
> }
> }
> {code}
> Example of link where the problem appears (put an http.content.limit highter
> than the content-length PDF) :
> http://www.annales.org/archives/x/marchal2.pdf
> Hope it can help ...
--
This message was sent by Atlassian Jira
(v8.3.4#803005)