[jira] [Commented] (NUTCH-2614) NPE in CrawlDbReader

2018-07-04 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532446#comment-16532446
 ] 

Markus Jelsma commented on NUTCH-2614:
--

Really? In that case my patch for NUTCH-2612 is probably wrong!

> NPE in CrawlDbReader
> 
>
> Key: NUTCH-2614
> URL: https://issues.apache.org/jira/browse/NUTCH-2614
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.14, 1.15
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.15
>
>
> Got this in master:
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:555)
> at org.apache.nutch.crawl.CrawlDbReader.run(CrawlDbReader.java:914)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:980)
> {code}
> Not sure why it happens or which commit caused the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-2614) NPE in CrawlDbReader

2018-07-04 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532446#comment-16532446
 ] 

Markus Jelsma edited comment on NUTCH-2614 at 7/4/18 9:26 AM:
--

-Really? In that case my patch for NUTCH-2612 is probably wrong!-

nevermind, that was due to strict parsing!


was (Author: markus17):
Really? In that case my patch for NUTCH-2612 is probably wrong!

> NPE in CrawlDbReader
> 
>
> Key: NUTCH-2614
> URL: https://issues.apache.org/jira/browse/NUTCH-2614
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.14, 1.15
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.15
>
>
> Got this in master:
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:555)
> at org.apache.nutch.crawl.CrawlDbReader.run(CrawlDbReader.java:914)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:980)
> {code}
> Not sure why it happens or which commit caused the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2612) Support for sitemap processing by hostname

2018-07-04 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532688#comment-16532688
 ] 

Sebastian Nagel commented on NUTCH-2612:


Hi [~markus17], shouldn't it be {{key.toString().startsWith("http://";)}} 
instead of {{"http://".equals(key.toString()}}?
{code}
 else if (value instanceof Text) {
  // Input can be sitemap URL or hostname
  if ("http://".equals(key.toString()) || ...
{code}

> Support for sitemap processing by hostname
> --
>
> Key: NUTCH-2612
> URL: https://issues.apache.org/jira/browse/NUTCH-2612
> Project: Nutch
>  Issue Type: Improvement
>  Components: sitemap
>Affects Versions: 1.14
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.16
>
> Attachments: NUTCH-2612.patch
>
>
> Add support to sitemap processor for processing just hostnames. Similar to 
> the mapper eating sitemap URL's, but then with BaseRobotRules finding the 
> sitemap URL's itself.
> Will upload patch soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2612) Support for sitemap processing by hostname

2018-07-04 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532691#comment-16532691
 ] 

Markus Jelsma commented on NUTCH-2612:
--

Yes of course! Will upload new patch!

> Support for sitemap processing by hostname
> --
>
> Key: NUTCH-2612
> URL: https://issues.apache.org/jira/browse/NUTCH-2612
> Project: Nutch
>  Issue Type: Improvement
>  Components: sitemap
>Affects Versions: 1.14
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.16
>
> Attachments: NUTCH-2612.patch
>
>
> Add support to sitemap processor for processing just hostnames. Similar to 
> the mapper eating sitemap URL's, but then with BaseRobotRules finding the 
> sitemap URL's itself.
> Will upload patch soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2614) NPE in CrawlDbReader

2018-07-04 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532693#comment-16532693
 ] 

Sebastian Nagel commented on NUTCH-2614:


Just to confirm: your CrawlDb was also empty?

> NPE in CrawlDbReader
> 
>
> Key: NUTCH-2614
> URL: https://issues.apache.org/jira/browse/NUTCH-2614
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.14, 1.15
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.15
>
>
> Got this in master:
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:555)
> at org.apache.nutch.crawl.CrawlDbReader.run(CrawlDbReader.java:914)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:980)
> {code}
> Not sure why it happens or which commit caused the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2614) NPE in CrawlDbReader

2018-07-04 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532699#comment-16532699
 ] 

Markus Jelsma commented on NUTCH-2614:
--

Yes!

> NPE in CrawlDbReader
> 
>
> Key: NUTCH-2614
> URL: https://issues.apache.org/jira/browse/NUTCH-2614
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.14, 1.15
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.15
>
>
> Got this in master:
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:555)
> at org.apache.nutch.crawl.CrawlDbReader.run(CrawlDbReader.java:914)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:980)
> {code}
> Not sure why it happens or which commit caused the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2612) Support for sitemap processing by hostname

2018-07-04 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2612:
-
Attachment: NUTCH-2612.patch

> Support for sitemap processing by hostname
> --
>
> Key: NUTCH-2612
> URL: https://issues.apache.org/jira/browse/NUTCH-2612
> Project: Nutch
>  Issue Type: Improvement
>  Components: sitemap
>Affects Versions: 1.14
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.16
>
> Attachments: NUTCH-2612.patch, NUTCH-2612.patch
>
>
> Add support to sitemap processor for processing just hostnames. Similar to 
> the mapper eating sitemap URL's, but then with BaseRobotRules finding the 
> sitemap URL's itself.
> Will upload patch soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2612) Support for sitemap processing by hostname

2018-07-04 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532712#comment-16532712
 ] 

Markus Jelsma commented on NUTCH-2612:
--

New patch! 

> Support for sitemap processing by hostname
> --
>
> Key: NUTCH-2612
> URL: https://issues.apache.org/jira/browse/NUTCH-2612
> Project: Nutch
>  Issue Type: Improvement
>  Components: sitemap
>Affects Versions: 1.14
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.16
>
> Attachments: NUTCH-2612.patch, NUTCH-2612.patch
>
>
> Add support to sitemap processor for processing just hostnames. Similar to 
> the mapper eating sitemap URL's, but then with BaseRobotRules finding the 
> sitemap URL's itself.
> Will upload patch soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2615) Publisher for Telegram

2018-07-04 Thread JIRA
Roannel Fernández Hernández created NUTCH-2615:
--

 Summary: Publisher for Telegram
 Key: NUTCH-2615
 URL: https://issues.apache.org/jira/browse/NUTCH-2615
 Project: Nutch
  Issue Type: New Feature
  Components: publisher
Affects Versions: 1.15
Reporter: Roannel Fernández Hernández
 Fix For: 1.16


Publisher plugin for [Telegram|https://telegram.org/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1541) Indexer plugin to write CSV

2018-07-04 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533015#comment-16533015
 ] 

ASF GitHub Bot commented on NUTCH-1541:
---

r0ann3l commented on issue #294: NUTCH-1541 Indexer plugin to write CSV
URL: https://github.com/apache/nutch/pull/294#issuecomment-402545528
 
 
   Hi @sebastian-nagel, good job!!! After merging the master there are some 
conflicts associated with changes made by 
[NUTCH-1480](https://issues.apache.org/jira/browse/NUTCH-1480). I've tried to 
fix these issues in 
[here](https://github.com/r0ann3l/nutch/commits/NUTCH-1541). You can to pick 
them from there.
   
   In addition, the options of this plugin should be migrated from the file 
nutch-site.xml to index-writers.xml as proposed by 
[NUTCH-1480](https://issues.apache.org/jira/browse/NUTCH-1480) and perhaps the 
prefix "indexer.csv." could be removed, because the index-writers.xml file 
avoids ambiguity across the index writers.
   
   As a recommendation I believe the static attributes you're using should be 
moved to an interface, how we have been doing in other indexers. I can 
contribute if you are agree with me.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Indexer plugin to write CSV
> ---
>
> Key: NUTCH-1541
> URL: https://issues.apache.org/jira/browse/NUTCH-1541
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 1.7
>Reporter: Sebastian Nagel
>Priority: Minor
> Attachments: NUTCH-1541-v1.patch, NUTCH-1541-v2.patch
>
>
> With the new pluggable indexer a simple plugin would be handy to write 
> configurable fields into a CSV file - for further analysis or just for export.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)