[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-05-14 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846402#comment-17846402
 ] 

Hudson commented on NUTCH-3043:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #161 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/161/])
NUTCH-3043 Generator: count URLs rejected by URL filters (#814) (github: 
[https://github.com/apache/nutch/commit/5f1330a03d136440a167a85da6cfe8ac4b3f61b9])
* (edit) src/java/org/apache/nutch/crawl/Generator.java


> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-05-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846357#comment-17846357
 ] 

ASF GitHub Bot commented on NUTCH-3043:
---

sebastian-nagel commented on PR #814:
URL: https://github.com/apache/nutch/pull/814#issuecomment-2110558876

   Thanks, @lewismc! The metrics wiki page was updated.




> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-05-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846355#comment-17846355
 ] 

ASF GitHub Bot commented on NUTCH-3043:
---

sebastian-nagel merged PR #814:
URL: https://github.com/apache/nutch/pull/814




> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841681#comment-17841681
 ] 

ASF GitHub Bot commented on NUTCH-3043:
---

lewismc commented on PR #814:
URL: https://github.com/apache/nutch/pull/814#issuecomment-2081563229

   Excellent @sebastian-nagel 




> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-04-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841472#comment-17841472
 ] 

ASF GitHub Bot commented on NUTCH-3043:
---

sebastian-nagel commented on PR #814:
URL: https://github.com/apache/nutch/pull/814#issuecomment-2080634329

   Hi @lewismc:
   - "use parameterized logging": done
   - "augment the [metrics 
documentation](https://cwiki.apache.org/confluence/display/NUTCH/Metrics) once 
this is merged.": will do
   - "we could also [create a test for the 
counters](https://cwiki.apache.org/confluence/display/MRUNIT/MRUnit+Tutorial#MRUnitTutorial-TestingCounters).":
 for now, TestGenerator is not based on MRUNIT. The various 
Generator::generate(...) return the number of generated segments without a way 
to access the counters (they're logged, however). I'd prefer to track this in a 
separate issue, because it would require to many code changes to read the 
counters.




> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840892#comment-17840892
 ] 

ASF GitHub Bot commented on NUTCH-3043:
---

lewismc commented on code in PR #814:
URL: https://github.com/apache/nutch/pull/814#discussion_r1579883313


##
src/java/org/apache/nutch/crawl/Generator.java:
##
@@ -253,10 +256,7 @@ public void map(Text key, CrawlDatum value, Context 
context)
   try {
 sort = scfilters.generatorSortValue(key, crawlDatum, sort);
   } catch (ScoringFilterException sfe) {
-if (LOG.isWarnEnabled()) {
-  LOG.warn(
-  "Couldn't filter generatorSortValue for " + key + ": " + sfe);
-}
+LOG.warn("Couldn't filter generatorSortValue for " + key + ": " + sfe);

Review Comment:
   Please use parameterized logging.
   ```
   LOG.warn("Couldn't filter generatorSortValue for {}: {}”, key, sfe);
   ```





> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840845#comment-17840845
 ] 

ASF GitHub Bot commented on NUTCH-3043:
---

sebastian-nagel opened a new pull request, #814:
URL: https://github.com/apache/nutch/pull/814

   - add counters URL_FILTERS_REJECTED and URL_FILTER_EXCEPTION
   - simplify logging statement
   - remove unnecessary cast




> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)