[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-11-13 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1370:
---

Attachment: NUTCH-1370-2.x-v3.patch

Hi Lewis, yes, the 1.x patch is not easily transferred for 2.x because of 
different (old vs. new) map reduce APIs. Here is a trial...
One question: the logged line "number of urls attempting to inject" suggests 
that there is a third count "urls successfully injected" or similar. What's the 
intention with "attempting"?


> Expose exact number of urls injected @runtime 
> --
>
> Key: NUTCH-1370
> URL: https://issues.apache.org/jira/browse/NUTCH-1370
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1370-1.x.patch, NUTCH-1370-2.x.patch, 
> NUTCH-1370-2.x-v2.patch, NUTCH-1370-2.x-v3.patch
>
>
> Example: When using trunk, currently we see 
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> I would like to see
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to 
> crawl/crawldb
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> This would make debugging easier and would help those who end up getting 
> {code}
> 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected 
> for fetching, exiting ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-11-13 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1370:


Attachment: NUTCH-1370-2.x-v2.patch

2nd WIP for 2.x I'm having difficulty correctly implementing JobClient#runJob 
as the currentJob param is not correct... 
{code}
RunningJob mapJob = JobClient.runJob(currentJob);
{code}

@Seb,
Regarding your patch, this looks great, is much cleaner than my proposal, I've 
tested and I'm +1 for committing.

> Expose exact number of urls injected @runtime 
> --
>
> Key: NUTCH-1370
> URL: https://issues.apache.org/jira/browse/NUTCH-1370
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1370-1.x.patch, NUTCH-1370-2.x.patch, 
> NUTCH-1370-2.x-v2.patch
>
>
> Example: When using trunk, currently we see 
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> I would like to see
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to 
> crawl/crawldb
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> This would make debugging easier and would help those who end up getting 
> {code}
> 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected 
> for fetching, exiting ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-11-13 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1370:


Patch Info: Patch Available

> Expose exact number of urls injected @runtime 
> --
>
> Key: NUTCH-1370
> URL: https://issues.apache.org/jira/browse/NUTCH-1370
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1370-1.x.patch, NUTCH-1370-2.x.patch, 
> NUTCH-1370-2.x-v2.patch
>
>
> Example: When using trunk, currently we see 
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> I would like to see
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to 
> crawl/crawldb
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> This would make debugging easier and would help those who end up getting 
> {code}
> 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected 
> for fetching, exiting ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-11-12 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1370:
---

Attachment: NUTCH-1370-1.x.patch

Ferdy is right: custom counters are more transparent.
Patch for 1.x


> Expose exact number of urls injected @runtime 
> --
>
> Key: NUTCH-1370
> URL: https://issues.apache.org/jira/browse/NUTCH-1370
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1370-1.x.patch, NUTCH-1370-2.x.patch
>
>
> Example: When using trunk, currently we see 
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> I would like to see
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to 
> crawl/crawldb
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> This would make debugging easier and would help those who end up getting 
> {code}
> 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected 
> for fetching, exiting ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-11-06 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1370:


Attachment: NUTCH-1370-2.x.patch

WIP patch for 2.x. I am convinced that I'm not using the Counters, Counter, or 
Job API correctly here. I've spent a bit of time attempting to work my way 
around the various classes and methods but I am not getting accurate values for 
the map input and output counters. If someone could take a look and correct  me 
here it would make my day. I will cook up the 1.x patch once I learn the right 
way.  

> Expose exact number of urls injected @runtime 
> --
>
> Key: NUTCH-1370
> URL: https://issues.apache.org/jira/browse/NUTCH-1370
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1370-2.x.patch
>
>
> Example: When using trunk, currently we see 
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> I would like to see
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to 
> crawl/crawldb
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> This would make debugging easier and would help those who end up getting 
> {code}
> 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected 
> for fetching, exiting ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1370:


Fix Version/s: (was: 2.1)
   2.2

> Expose exact number of urls injected @runtime 
> --
>
> Key: NUTCH-1370
> URL: https://issues.apache.org/jira/browse/NUTCH-1370
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.2
>
>
> Example: When using trunk, currently we see 
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> I would like to see
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to 
> crawl/crawldb
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> This would make debugging easier and would help those who end up getting 
> {code}
> 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected 
> for fetching, exiting ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-06-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1370:
-

Affects Version/s: (was: 1.4)
   1.5
Fix Version/s: (was: 1.5)
   1.6

> Expose exact number of urls injected @runtime 
> --
>
> Key: NUTCH-1370
> URL: https://issues.apache.org/jira/browse/NUTCH-1370
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.1
>
>
> Example: When using trunk, currently we see 
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> I would like to see
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to 
> crawl/crawldb
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> This would make debugging easier and would help those who end up getting 
> {code}
> 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected 
> for fetching, exiting ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-05-22 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1370:
-

Priority: Minor  (was: Major)

Running in pseudo-distributed mode gives you more information if you look at 
the Hadoop web interface. You get the number of items passed to the mappers and 
reducers etc... You can of course add a message like this in the logs, won't do 
any harm :-)

> Expose exact number of urls injected @runtime 
> --
>
> Key: NUTCH-1370
> URL: https://issues.apache.org/jira/browse/NUTCH-1370
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.4, nutchgora
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.5, 2.1
>
>
> Example: When using trunk, currently we see 
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> I would like to see
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to 
> crawl/crawldb
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> This would make debugging easier and would help those who end up getting 
> {code}
> 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected 
> for fetching, exiting ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira