[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846402#comment-17846402 ] Hudson commented on NUTCH-3043: --- SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #161 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/161/]) NUTCH-3043 Generator: count URLs rejected by URL filters (#814) (github: [https://github.com/apache/nutch/commit/5f1330a03d136440a167a85da6cfe8ac4b3f61b9]) * (edit) src/java/org/apache/nutch/crawl/Generator.java > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846357#comment-17846357 ] ASF GitHub Bot commented on NUTCH-3043: --- sebastian-nagel commented on PR #814: URL: https://github.com/apache/nutch/pull/814#issuecomment-2110558876 Thanks, @lewismc! The metrics wiki page was updated. > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846355#comment-17846355 ] ASF GitHub Bot commented on NUTCH-3043: --- sebastian-nagel merged PR #814: URL: https://github.com/apache/nutch/pull/814 > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841681#comment-17841681 ] ASF GitHub Bot commented on NUTCH-3043: --- lewismc commented on PR #814: URL: https://github.com/apache/nutch/pull/814#issuecomment-2081563229 Excellent @sebastian-nagel > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841472#comment-17841472 ] ASF GitHub Bot commented on NUTCH-3043: --- sebastian-nagel commented on PR #814: URL: https://github.com/apache/nutch/pull/814#issuecomment-2080634329 Hi @lewismc: - "use parameterized logging": done - "augment the [metrics documentation](https://cwiki.apache.org/confluence/display/NUTCH/Metrics) once this is merged.": will do - "we could also [create a test for the counters](https://cwiki.apache.org/confluence/display/MRUNIT/MRUnit+Tutorial#MRUnitTutorial-TestingCounters).": for now, TestGenerator is not based on MRUNIT. The various Generator::generate(...) return the number of generated segments without a way to access the counters (they're logged, however). I'd prefer to track this in a separate issue, because it would require to many code changes to read the counters. > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840892#comment-17840892 ] ASF GitHub Bot commented on NUTCH-3043: --- lewismc commented on code in PR #814: URL: https://github.com/apache/nutch/pull/814#discussion_r1579883313 ## src/java/org/apache/nutch/crawl/Generator.java: ## @@ -253,10 +256,7 @@ public void map(Text key, CrawlDatum value, Context context) try { sort = scfilters.generatorSortValue(key, crawlDatum, sort); } catch (ScoringFilterException sfe) { -if (LOG.isWarnEnabled()) { - LOG.warn( - "Couldn't filter generatorSortValue for " + key + ": " + sfe); -} +LOG.warn("Couldn't filter generatorSortValue for " + key + ": " + sfe); Review Comment: Please use parameterized logging. ``` LOG.warn("Couldn't filter generatorSortValue for {}: {}”, key, sfe); ``` > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840845#comment-17840845 ] ASF GitHub Bot commented on NUTCH-3043: --- sebastian-nagel opened a new pull request, #814: URL: https://github.com/apache/nutch/pull/814 - add counters URL_FILTERS_REJECTED and URL_FILTER_EXCEPTION - simplify logging statement - remove unnecessary cast > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)