[jira] [Commented] (NUTCH-2614) NPE in CrawlDbReader
[ https://issues.apache.org/jira/browse/NUTCH-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532446#comment-16532446 ] Markus Jelsma commented on NUTCH-2614: -- Really? In that case my patch for NUTCH-2612 is probably wrong! > NPE in CrawlDbReader > > > Key: NUTCH-2614 > URL: https://issues.apache.org/jira/browse/NUTCH-2614 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.14, 1.15 >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.15 > > > Got this in master: > {code} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:555) > at org.apache.nutch.crawl.CrawlDbReader.run(CrawlDbReader.java:914) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:980) > {code} > Not sure why it happens or which commit caused the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (NUTCH-2614) NPE in CrawlDbReader
[ https://issues.apache.org/jira/browse/NUTCH-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532446#comment-16532446 ] Markus Jelsma edited comment on NUTCH-2614 at 7/4/18 9:26 AM: -- -Really? In that case my patch for NUTCH-2612 is probably wrong!- nevermind, that was due to strict parsing! was (Author: markus17): Really? In that case my patch for NUTCH-2612 is probably wrong! > NPE in CrawlDbReader > > > Key: NUTCH-2614 > URL: https://issues.apache.org/jira/browse/NUTCH-2614 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.14, 1.15 >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.15 > > > Got this in master: > {code} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:555) > at org.apache.nutch.crawl.CrawlDbReader.run(CrawlDbReader.java:914) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:980) > {code} > Not sure why it happens or which commit caused the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2612) Support for sitemap processing by hostname
[ https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532688#comment-16532688 ] Sebastian Nagel commented on NUTCH-2612: Hi [~markus17], shouldn't it be {{key.toString().startsWith("http://";)}} instead of {{"http://".equals(key.toString()}}? {code} else if (value instanceof Text) { // Input can be sitemap URL or hostname if ("http://".equals(key.toString()) || ... {code} > Support for sitemap processing by hostname > -- > > Key: NUTCH-2612 > URL: https://issues.apache.org/jira/browse/NUTCH-2612 > Project: Nutch > Issue Type: Improvement > Components: sitemap >Affects Versions: 1.14 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.16 > > Attachments: NUTCH-2612.patch > > > Add support to sitemap processor for processing just hostnames. Similar to > the mapper eating sitemap URL's, but then with BaseRobotRules finding the > sitemap URL's itself. > Will upload patch soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2612) Support for sitemap processing by hostname
[ https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532691#comment-16532691 ] Markus Jelsma commented on NUTCH-2612: -- Yes of course! Will upload new patch! > Support for sitemap processing by hostname > -- > > Key: NUTCH-2612 > URL: https://issues.apache.org/jira/browse/NUTCH-2612 > Project: Nutch > Issue Type: Improvement > Components: sitemap >Affects Versions: 1.14 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.16 > > Attachments: NUTCH-2612.patch > > > Add support to sitemap processor for processing just hostnames. Similar to > the mapper eating sitemap URL's, but then with BaseRobotRules finding the > sitemap URL's itself. > Will upload patch soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2614) NPE in CrawlDbReader
[ https://issues.apache.org/jira/browse/NUTCH-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532693#comment-16532693 ] Sebastian Nagel commented on NUTCH-2614: Just to confirm: your CrawlDb was also empty? > NPE in CrawlDbReader > > > Key: NUTCH-2614 > URL: https://issues.apache.org/jira/browse/NUTCH-2614 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.14, 1.15 >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.15 > > > Got this in master: > {code} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:555) > at org.apache.nutch.crawl.CrawlDbReader.run(CrawlDbReader.java:914) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:980) > {code} > Not sure why it happens or which commit caused the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2614) NPE in CrawlDbReader
[ https://issues.apache.org/jira/browse/NUTCH-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532699#comment-16532699 ] Markus Jelsma commented on NUTCH-2614: -- Yes! > NPE in CrawlDbReader > > > Key: NUTCH-2614 > URL: https://issues.apache.org/jira/browse/NUTCH-2614 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.14, 1.15 >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.15 > > > Got this in master: > {code} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:555) > at org.apache.nutch.crawl.CrawlDbReader.run(CrawlDbReader.java:914) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:980) > {code} > Not sure why it happens or which commit caused the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2612) Support for sitemap processing by hostname
[ https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2612: - Attachment: NUTCH-2612.patch > Support for sitemap processing by hostname > -- > > Key: NUTCH-2612 > URL: https://issues.apache.org/jira/browse/NUTCH-2612 > Project: Nutch > Issue Type: Improvement > Components: sitemap >Affects Versions: 1.14 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.16 > > Attachments: NUTCH-2612.patch, NUTCH-2612.patch > > > Add support to sitemap processor for processing just hostnames. Similar to > the mapper eating sitemap URL's, but then with BaseRobotRules finding the > sitemap URL's itself. > Will upload patch soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2612) Support for sitemap processing by hostname
[ https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532712#comment-16532712 ] Markus Jelsma commented on NUTCH-2612: -- New patch! > Support for sitemap processing by hostname > -- > > Key: NUTCH-2612 > URL: https://issues.apache.org/jira/browse/NUTCH-2612 > Project: Nutch > Issue Type: Improvement > Components: sitemap >Affects Versions: 1.14 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.16 > > Attachments: NUTCH-2612.patch, NUTCH-2612.patch > > > Add support to sitemap processor for processing just hostnames. Similar to > the mapper eating sitemap URL's, but then with BaseRobotRules finding the > sitemap URL's itself. > Will upload patch soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2615) Publisher for Telegram
Roannel Fernández Hernández created NUTCH-2615: -- Summary: Publisher for Telegram Key: NUTCH-2615 URL: https://issues.apache.org/jira/browse/NUTCH-2615 Project: Nutch Issue Type: New Feature Components: publisher Affects Versions: 1.15 Reporter: Roannel Fernández Hernández Fix For: 1.16 Publisher plugin for [Telegram|https://telegram.org/] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1541) Indexer plugin to write CSV
[ https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533015#comment-16533015 ] ASF GitHub Bot commented on NUTCH-1541: --- r0ann3l commented on issue #294: NUTCH-1541 Indexer plugin to write CSV URL: https://github.com/apache/nutch/pull/294#issuecomment-402545528 Hi @sebastian-nagel, good job!!! After merging the master there are some conflicts associated with changes made by [NUTCH-1480](https://issues.apache.org/jira/browse/NUTCH-1480). I've tried to fix these issues in [here](https://github.com/r0ann3l/nutch/commits/NUTCH-1541). You can to pick them from there. In addition, the options of this plugin should be migrated from the file nutch-site.xml to index-writers.xml as proposed by [NUTCH-1480](https://issues.apache.org/jira/browse/NUTCH-1480) and perhaps the prefix "indexer.csv." could be removed, because the index-writers.xml file avoids ambiguity across the index writers. As a recommendation I believe the static attributes you're using should be moved to an interface, how we have been doing in other indexers. I can contribute if you are agree with me. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Indexer plugin to write CSV > --- > > Key: NUTCH-1541 > URL: https://issues.apache.org/jira/browse/NUTCH-1541 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 1.7 >Reporter: Sebastian Nagel >Priority: Minor > Attachments: NUTCH-1541-v1.patch, NUTCH-1541-v2.patch > > > With the new pluggable indexer a simple plugin would be handy to write > configurable fields into a CSV file - for further analysis or just for export. -- This message was sent by Atlassian JIRA (v7.6.3#76005)