[ http://issues.apache.org/jira/browse/NUTCH-182?page=all ]
Matt Kangas updated NUTCH-182:
------------------------------
Attachment: ParseData.java.patch
LinkDb.java.patch
Two patches are attached for nutch/trunk (0.8-dev).
LinkDb.java.patch adds two new LOG.info() statements:
* "Exceeded db.max.anchor.length for URL <url>"
* "Exceeded db.max.inlinks for URL <url>"
ParseData.java.patch adds a private static LOG variable, pluse one LOG.info()
statement:
* "Exceeded db.max.outlinks.per.page"
I would have preferred to print the URL too on the latter, but it's not
available in the method where the cutoff is performed (afaik).
> Log when db.max configuration limits reached
> --------------------------------------------
>
> Key: NUTCH-182
> URL: http://issues.apache.org/jira/browse/NUTCH-182
> Project: Nutch
> Type: Improvement
> Components: fetcher
> Versions: 0.8-dev
> Reporter: Matt Kangas
> Priority: Trivial
> Attachments: LinkDb.java.patch, ParseData.java.patch
>
> Followup to http://www.nabble.com/Re%3A-Can%27t-index-some-pages-p2480833.html
> There are three "db.max" parameters currently in nutch-default.xml:
> * db.max.outlinks.per.page
> * db.max.anchor.length
> * db.max.inlinks
> Having values that are too low can result in a site being under-crawled.
> However, currently there is nothing written to the log when these limits are
> hit, so users have to guess when they need to raise these values.
> I suggest that we add three new log messages at the appropriate points:
> * "Exceeded db.max.outlinks.per.page for URL "
> * "Exceeded db.max.anchor.length for URL "
> * "Exceeded db.max.inlinks for URL "
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira