[
https://issues.apache.org/jira/browse/NUTCH-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexandre Demeyer updated NUTCH-2075:
-------------------------------------
Description:
It appears that there is a bug about certain links where nutch erases all
markers and not only the inject, generate, fetch, parse, update markers but
also the distance marker.
The problem is that Nutch Generator doesn't check the validity of the marker
distance (check if it's null) and keep wrong links (without the distance
marker) in the GeneratorMapper. When the distance filter is activated,
GeneratorMapper choose also URL without markers and so it doesn't repect the
limit.
I think it's in relation with the problem mention here :
[NUTCH-1930|https://issues.apache.org/jira/browse/NUTCH-1930].
This patch doesn't solved the problem which is all markers are erased (without
any reasons apparently ..). But it can allow to stop the crawl...
In order to find a solution about stopping crawl with problematics URL, I
proposed this solution which is simply to avoid the URL when the distance
marker is NULL.
(Sorry if i put the code here)
{code:title=crawl/GeneratorMapper.java (initial code)|borderStyle=solid}
// filter on distance
if (maxDistance > -1) {
CharSequence distanceUtf8 = page.getMarkers().get(DbUpdaterJob.DISTANCE);
if (distanceUtf8 != null) {
int distance = Integer.parseInt(distanceUtf8.toString());
if (distance > maxDistance) {
return;
}
}
}
{code}
{code:title=crawl/GeneratorMapper.java (patch code)|borderStyle=solid}
// filter on distance
if (maxDistance > -1) {
CharSequence distanceUtf8 = page.getMarkers().get(DbUpdaterJob.DISTANCE);
if (distanceUtf8 != null) {
int distance = Integer.parseInt(distanceUtf8.toString());
if (distance > maxDistance) {
return;
}
}
else
{
// No distance marker, URL problem
return;
}
}
{code}
Example of link where the problem appears (put an http.content.limit highter
than the content-length PDF) :
http://www.annales.org/archives/x/marchal2.pdf
Hope it can help ...
was:
It appears that there is a bug about certain links where nutch erases all
markers and not only the inject, generate, fetch, parse, update markers but
also the distance marker.
The problem is that Nutch Generator doesn't check the validity of the marker
distance (check if it's null) and keep wrong links (without the distance
marker) in the GeneratorMapper.
I think it's in relation with the problem mention here :
[NUTCH-1930|https://issues.apache.org/jira/browse/NUTCH-1930].
This patch doesn't solved the problem which is all markers are erased (without
any reasons apparently ..). But it can allow to stop the crawl...
In order to find a solution about stopping crawl with problematics URL, I
proposed this solution which is simply to avoid the URL when the distance
marker is NULL.
(Sorry if i put the code here)
{code:title=crawl/GeneratorMapper.java (initial code)|borderStyle=solid}
// filter on distance
if (maxDistance > -1) {
CharSequence distanceUtf8 = page.getMarkers().get(DbUpdaterJob.DISTANCE);
if (distanceUtf8 != null) {
int distance = Integer.parseInt(distanceUtf8.toString());
if (distance > maxDistance) {
return;
}
}
}
{code}
{code:title=crawl/GeneratorMapper.java (patch code)|borderStyle=solid}
// filter on distance
if (maxDistance > -1) {
CharSequence distanceUtf8 = page.getMarkers().get(DbUpdaterJob.DISTANCE);
if (distanceUtf8 != null) {
int distance = Integer.parseInt(distanceUtf8.toString());
if (distance > maxDistance) {
return;
}
}
else
{
// No distance marker, URL problem
return;
}
}
{code}
Example of link where the problem appears (put an http.content.limit highter
than the content-length PDF) :
http://www.annales.org/archives/x/marchal2.pdf
Hope it can help ...
> Generate will not choose URL marker distance NULL
> -------------------------------------------------
>
> Key: NUTCH-2075
> URL: https://issues.apache.org/jira/browse/NUTCH-2075
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 2.3
> Environment: Using HBase as back-end Storage
> Reporter: Alexandre Demeyer
> Priority: Minor
> Labels: newbie, patch, performance
>
> It appears that there is a bug about certain links where nutch erases all
> markers and not only the inject, generate, fetch, parse, update markers but
> also the distance marker.
> The problem is that Nutch Generator doesn't check the validity of the marker
> distance (check if it's null) and keep wrong links (without the distance
> marker) in the GeneratorMapper. When the distance filter is activated,
> GeneratorMapper choose also URL without markers and so it doesn't repect the
> limit.
> I think it's in relation with the problem mention here :
> [NUTCH-1930|https://issues.apache.org/jira/browse/NUTCH-1930].
> This patch doesn't solved the problem which is all markers are erased
> (without any reasons apparently ..). But it can allow to stop the crawl...
> In order to find a solution about stopping crawl with problematics URL, I
> proposed this solution which is simply to avoid the URL when the distance
> marker is NULL.
> (Sorry if i put the code here)
> {code:title=crawl/GeneratorMapper.java (initial code)|borderStyle=solid}
> // filter on distance
> if (maxDistance > -1) {
> CharSequence distanceUtf8 =
> page.getMarkers().get(DbUpdaterJob.DISTANCE);
> if (distanceUtf8 != null) {
> int distance = Integer.parseInt(distanceUtf8.toString());
> if (distance > maxDistance) {
> return;
> }
> }
> }
> {code}
> {code:title=crawl/GeneratorMapper.java (patch code)|borderStyle=solid}
> // filter on distance
> if (maxDistance > -1) {
> CharSequence distanceUtf8 =
> page.getMarkers().get(DbUpdaterJob.DISTANCE);
> if (distanceUtf8 != null) {
> int distance = Integer.parseInt(distanceUtf8.toString());
> if (distance > maxDistance) {
> return;
> }
> }
> else
> {
> // No distance marker, URL problem
> return;
> }
> }
> {code}
> Example of link where the problem appears (put an http.content.limit highter
> than the content-length PDF) :
> http://www.annales.org/archives/x/marchal2.pdf
> Hope it can help ...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)