[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
[ https://issues.apache.org/jira/browse/NUTCH-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186077#comment-16186077 ] ASF GitHub Bot commented on NUTCH-2375: --- Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-333180101 @lewismc @sebastian-nagel you are right, I will start testing it in pseudo-distributed mode first. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade the code base from org.apache.hadoop.mapred to > org.apache.hadoop.mapreduce > -- > > Key: NUTCH-2375 > URL: https://issues.apache.org/jira/browse/NUTCH-2375 > Project: Nutch > Issue Type: Improvement > Components: deployment >Reporter: Omkar Reddy > > Nutch is still using the deprecated org.apache.hadoop.mapred dependency which > has been deprecated. It need to be updated to org.apache.hadoop.mapreduce > dependency. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2435) New configuration allowing to choose whether to store 'parse_text' directory or not.
[ https://issues.apache.org/jira/browse/NUTCH-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185800#comment-16185800 ] ASF GitHub Bot commented on NUTCH-2435: --- sebastian-nagel commented on issue #225: NUTCH-2435 - New parameter "parser.store.text" URL: https://github.com/apache/nutch/pull/225#issuecomment-333122656 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > New configuration allowing to choose whether to store 'parse_text' directory > or not. > > > Key: NUTCH-2435 > URL: https://issues.apache.org/jira/browse/NUTCH-2435 > Project: Nutch > Issue Type: New Feature > Components: parser >Affects Versions: 1.13 > Environment: Apach Nutch 1.13 >Reporter: Marcos Bori > > Whenever a page is parsed, one of the outputs is the directory 'parse_text'. > It is intended to be used at the indexing phase so the page can be searched > from a search engine such as Solr. > In my special crawling case, I don't need to index the page contents. > Therefore, creating and filing the 'parse_text' is not required for me. To > optimize performance, I don't want the crawler to store this information to > the filesystem. > I propose a new parameter "parser.store.text" allowing to choose whether to > store 'parse_text' directory or not. Its default value, of course, is "true". -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2433) Html Parser: keep htmltag where the outlinks are found
[ https://issues.apache.org/jira/browse/NUTCH-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185784#comment-16185784 ] Hudson commented on NUTCH-2433: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3457 (See [https://builds.apache.org/job/Nutch-trunk/3457/]) NUTCH-2433 / Html Parser: keep htmltag where the outlinks are found (marcos: [https://github.com/apache/nutch/commit/7db11734f25a53cda15634071a47ff524a06002e]) * (edit) conf/nutch-default.xml * (edit) src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java > Html Parser: keep htmltag where the outlinks are found > -- > > Key: NUTCH-2433 > URL: https://issues.apache.org/jira/browse/NUTCH-2433 > Project: Nutch > Issue Type: New Feature > Components: parser >Affects Versions: 1.13 > Environment: Apache Nutch release 1.13. >Reporter: Marcos Bori > Labels: html, outlink > Fix For: 1.14 > > > When parsing HTML pages, I need to know in which HTML tag the outlinks were > found (for example, 'a', 'script', 'img', etc). > I propose to add a new configuration value, > "parser.html.outlinks.htmlnode_metadata_name". > If this configuration property is not empty, all found outlinks will be > assigned a metadata with the name indicated in this configuration property > with the html tag name where the outlink was found. > I will now send the pull request with my code implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2435) New configuration allowing to choose whether to store 'parse_text' directory or not.
[ https://issues.apache.org/jira/browse/NUTCH-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185755#comment-16185755 ] ASF GitHub Bot commented on NUTCH-2435: --- maborec commented on a change in pull request #225: NUTCH-2435 - New parameter "parser.store.text" URL: https://github.com/apache/nutch/pull/225#discussion_r141856440 ## File path: src/java/org/apache/nutch/parse/ParseOutputFormat.java ## @@ -128,13 +131,18 @@ public void checkOutputSpecs(FileSystem fs, JobConf job) throws IOException { .split(" *, *"); // textOut Options -Option tKeyClassOpt = (Option) MapFile.Writer.keyClass(Text.class); -org.apache.hadoop.io.SequenceFile.Writer.Option tValClassOpt = SequenceFile.Writer.valueClass(ParseText.class); -org.apache.hadoop.io.SequenceFile.Writer.Option tProgressOpt = SequenceFile.Writer.progressable(progress); -org.apache.hadoop.io.SequenceFile.Writer.Option tCompOpt = SequenceFile.Writer.compression(CompressionType.RECORD); +final MapFile.Writer textOut; +if (storeText) { + Option tKeyClassOpt = (Option) MapFile.Writer.keyClass(Text.class); + org.apache.hadoop.io.SequenceFile.Writer.Option tValClassOpt = SequenceFile.Writer.valueClass(ParseText.class); + org.apache.hadoop.io.SequenceFile.Writer.Option tProgressOpt = SequenceFile.Writer.progressable(progress); + org.apache.hadoop.io.SequenceFile.Writer.Option tCompOpt = SequenceFile.Writer.compression(CompressionType.RECORD); -final MapFile.Writer textOut = new MapFile.Writer(job, text, + textOut = new MapFile.Writer(job, text, Review comment: Format applied This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > New configuration allowing to choose whether to store 'parse_text' directory > or not. > > > Key: NUTCH-2435 > URL: https://issues.apache.org/jira/browse/NUTCH-2435 > Project: Nutch > Issue Type: New Feature > Components: parser >Affects Versions: 1.13 > Environment: Apach Nutch 1.13 >Reporter: Marcos Bori > > Whenever a page is parsed, one of the outputs is the directory 'parse_text'. > It is intended to be used at the indexing phase so the page can be searched > from a search engine such as Solr. > In my special crawling case, I don't need to index the page contents. > Therefore, creating and filing the 'parse_text' is not required for me. To > optimize performance, I don't want the crawler to store this information to > the filesystem. > I propose a new parameter "parser.store.text" allowing to choose whether to > store 'parse_text' directory or not. Its default value, of course, is "true". -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (NUTCH-2433) Html Parser: keep htmltag where the outlinks are found
[ https://issues.apache.org/jira/browse/NUTCH-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2433. Resolution: Fixed Fix Version/s: 1.14 Thanks, committed to 1.x, [777e759|https://github.com/apache/nutch/commit/777e759ada24eac84072a5f1722938442432eadc]. > Html Parser: keep htmltag where the outlinks are found > -- > > Key: NUTCH-2433 > URL: https://issues.apache.org/jira/browse/NUTCH-2433 > Project: Nutch > Issue Type: New Feature > Components: parser >Affects Versions: 1.13 > Environment: Apache Nutch release 1.13. >Reporter: Marcos Bori > Labels: html, outlink > Fix For: 1.14 > > > When parsing HTML pages, I need to know in which HTML tag the outlinks were > found (for example, 'a', 'script', 'img', etc). > I propose to add a new configuration value, > "parser.html.outlinks.htmlnode_metadata_name". > If this configuration property is not empty, all found outlinks will be > assigned a metadata with the name indicated in this configuration property > with the html tag name where the outlink was found. > I will now send the pull request with my code implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2268) SolrIndexerJob: java.lang.RuntimeException
[ https://issues.apache.org/jira/browse/NUTCH-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185580#comment-16185580 ] Ronan commented on NUTCH-2268: -- I have the exact same error. Do someone know how to solve it ? > SolrIndexerJob: java.lang.RuntimeException > -- > > Key: NUTCH-2268 > URL: https://issues.apache.org/jira/browse/NUTCH-2268 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 2.3.1 > Environment: iam using > Hbase V:hbase-0.98.19-hadoop2 > Solr V : 6.0.0 > Nutch : 2.3.1 > java : 8 >Reporter: narendra > Labels: indexing > Original Estimate: 12h > Remaining Estimate: 12h > > Could you please help out of this error > SolrIndexerJob: java.lang.RuntimeException: job > failed:name=apache-nutch-2.3.1.jar > when i run this commend > local/bin/nutch solrindex http://localhost:8983/solr/ -all > Tried with Solr 4.10.3 but same error iam getting -- This message was sent by Atlassian JIRA (v6.4.14#64029)