[ http://issues.apache.org/jira/browse/NUTCH-182?page=all ]

Matt Kangas updated NUTCH-182:
------------------------------

    Attachment: ParseData.java.patch
                LinkDb.java.patch

Two patches are attached for nutch/trunk (0.8-dev).

LinkDb.java.patch adds two new LOG.info() statements:
 * "Exceeded db.max.anchor.length for URL <url>"
 * "Exceeded db.max.inlinks for URL <url>"

ParseData.java.patch adds a private static LOG variable, pluse one LOG.info() 
statement:
 * "Exceeded db.max.outlinks.per.page"

I would have preferred to print the URL too on the latter, but it's not 
available in the method where the cutoff is performed (afaik).

> Log when db.max configuration limits reached
> --------------------------------------------
>
>          Key: NUTCH-182
>          URL: http://issues.apache.org/jira/browse/NUTCH-182
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: Matt Kangas
>     Priority: Trivial
>  Attachments: LinkDb.java.patch, ParseData.java.patch
>
> Followup to http://www.nabble.com/Re%3A-Can%27t-index-some-pages-p2480833.html
> There are three "db.max" parameters currently in nutch-default.xml:
>  * db.max.outlinks.per.page
>  * db.max.anchor.length
>  * db.max.inlinks
> Having values that are too low can result in a site being under-crawled. 
> However, currently there is nothing written to the log when these limits are 
> hit, so users have to guess when they need to raise these values.
> I suggest that we add three new log messages at the appropriate points:
>  * "Exceeded db.max.outlinks.per.page for URL "
>  * "Exceeded db.max.anchor.length for URL "
>  * "Exceeded db.max.inlinks for URL "

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to