[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

2017-09-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186077#comment-16186077
 ] 

ASF GitHub Bot commented on NUTCH-2375:
---

Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use 
org.apache.hadoop.mapreduce
URL: https://github.com/apache/nutch/pull/221#issuecomment-333180101
 
 
   @lewismc @sebastian-nagel you are right, I will start testing it in 
pseudo-distributed mode first. Thanks. 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade the code base from org.apache.hadoop.mapred to 
> org.apache.hadoop.mapreduce
> --
>
> Key: NUTCH-2375
> URL: https://issues.apache.org/jira/browse/NUTCH-2375
> Project: Nutch
>  Issue Type: Improvement
>  Components: deployment
>Reporter: Omkar Reddy
>
> Nutch is still using the deprecated org.apache.hadoop.mapred dependency which 
> has been deprecated. It need to be updated to org.apache.hadoop.mapreduce 
> dependency. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2435) New configuration allowing to choose whether to store 'parse_text' directory or not.

2017-09-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185800#comment-16185800
 ] 

ASF GitHub Bot commented on NUTCH-2435:
---

sebastian-nagel commented on issue #225: NUTCH-2435 - New parameter 
"parser.store.text"
URL: https://github.com/apache/nutch/pull/225#issuecomment-333122656
 
 
   +1
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> New configuration allowing to choose whether to store 'parse_text' directory 
> or not.
> 
>
> Key: NUTCH-2435
> URL: https://issues.apache.org/jira/browse/NUTCH-2435
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.13
> Environment: Apach Nutch 1.13
>Reporter: Marcos Bori
>
> Whenever a page is parsed, one of the outputs is the directory 'parse_text'.
> It is intended to be used at the indexing phase so the page can be searched 
> from a search engine such as Solr.
> In my special crawling case, I don't need to index the page contents. 
> Therefore, creating and filing the 'parse_text' is not required for me. To 
> optimize performance, I don't want the crawler to store this information to 
> the filesystem. 
> I propose a new parameter "parser.store.text" allowing to choose whether to 
> store 'parse_text' directory or not. Its default value, of course, is "true".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2433) Html Parser: keep htmltag where the outlinks are found

2017-09-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185784#comment-16185784
 ] 

Hudson commented on NUTCH-2433:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3457 (See 
[https://builds.apache.org/job/Nutch-trunk/3457/])
NUTCH-2433 / Html Parser: keep htmltag where the outlinks are found (marcos: 
[https://github.com/apache/nutch/commit/7db11734f25a53cda15634071a47ff524a06002e])
* (edit) conf/nutch-default.xml
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java


> Html Parser: keep htmltag where the outlinks are found
> --
>
> Key: NUTCH-2433
> URL: https://issues.apache.org/jira/browse/NUTCH-2433
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.13
> Environment: Apache Nutch release 1.13.
>Reporter: Marcos Bori
>  Labels: html, outlink
> Fix For: 1.14
>
>
> When parsing HTML pages, I need to know in which HTML tag the outlinks were 
> found (for example, 'a', 'script', 'img', etc).
> I propose to add a new configuration value, 
> "parser.html.outlinks.htmlnode_metadata_name".
> If this configuration property is not empty, all found outlinks will be 
> assigned a metadata with the name indicated in this configuration property 
> with the html tag name where the outlink was found.
> I will now send the pull request with my code implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2435) New configuration allowing to choose whether to store 'parse_text' directory or not.

2017-09-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185755#comment-16185755
 ] 

ASF GitHub Bot commented on NUTCH-2435:
---

maborec commented on a change in pull request #225: NUTCH-2435 - New parameter 
"parser.store.text"
URL: https://github.com/apache/nutch/pull/225#discussion_r141856440
 
 

 ##
 File path: src/java/org/apache/nutch/parse/ParseOutputFormat.java
 ##
 @@ -128,13 +131,18 @@ public void checkOutputSpecs(FileSystem fs, JobConf job) 
throws IOException {
 .split(" *, *");
 
 // textOut Options
-Option tKeyClassOpt = (Option) MapFile.Writer.keyClass(Text.class);
-org.apache.hadoop.io.SequenceFile.Writer.Option tValClassOpt = 
SequenceFile.Writer.valueClass(ParseText.class);
-org.apache.hadoop.io.SequenceFile.Writer.Option tProgressOpt = 
SequenceFile.Writer.progressable(progress);
-org.apache.hadoop.io.SequenceFile.Writer.Option tCompOpt = 
SequenceFile.Writer.compression(CompressionType.RECORD);
+final MapFile.Writer textOut;
+if (storeText) {
+  Option tKeyClassOpt = (Option) MapFile.Writer.keyClass(Text.class);
+  org.apache.hadoop.io.SequenceFile.Writer.Option tValClassOpt = 
SequenceFile.Writer.valueClass(ParseText.class);
+  org.apache.hadoop.io.SequenceFile.Writer.Option tProgressOpt = 
SequenceFile.Writer.progressable(progress);
+  org.apache.hadoop.io.SequenceFile.Writer.Option tCompOpt = 
SequenceFile.Writer.compression(CompressionType.RECORD);
 
-final MapFile.Writer textOut = new MapFile.Writer(job, text,
+  textOut = new MapFile.Writer(job, text,
 
 Review comment:
   Format applied
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> New configuration allowing to choose whether to store 'parse_text' directory 
> or not.
> 
>
> Key: NUTCH-2435
> URL: https://issues.apache.org/jira/browse/NUTCH-2435
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.13
> Environment: Apach Nutch 1.13
>Reporter: Marcos Bori
>
> Whenever a page is parsed, one of the outputs is the directory 'parse_text'.
> It is intended to be used at the indexing phase so the page can be searched 
> from a search engine such as Solr.
> In my special crawling case, I don't need to index the page contents. 
> Therefore, creating and filing the 'parse_text' is not required for me. To 
> optimize performance, I don't want the crawler to store this information to 
> the filesystem. 
> I propose a new parameter "parser.store.text" allowing to choose whether to 
> store 'parse_text' directory or not. Its default value, of course, is "true".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2433) Html Parser: keep htmltag where the outlinks are found

2017-09-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2433.

   Resolution: Fixed
Fix Version/s: 1.14

Thanks, committed to 1.x, 
[777e759|https://github.com/apache/nutch/commit/777e759ada24eac84072a5f1722938442432eadc].

> Html Parser: keep htmltag where the outlinks are found
> --
>
> Key: NUTCH-2433
> URL: https://issues.apache.org/jira/browse/NUTCH-2433
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.13
> Environment: Apache Nutch release 1.13.
>Reporter: Marcos Bori
>  Labels: html, outlink
> Fix For: 1.14
>
>
> When parsing HTML pages, I need to know in which HTML tag the outlinks were 
> found (for example, 'a', 'script', 'img', etc).
> I propose to add a new configuration value, 
> "parser.html.outlinks.htmlnode_metadata_name".
> If this configuration property is not empty, all found outlinks will be 
> assigned a metadata with the name indicated in this configuration property 
> with the html tag name where the outlink was found.
> I will now send the pull request with my code implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2268) SolrIndexerJob: java.lang.RuntimeException

2017-09-29 Thread Ronan (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185580#comment-16185580
 ] 

Ronan commented on NUTCH-2268:
--

I have the exact same error.
Do someone know how to solve it ?

> SolrIndexerJob: java.lang.RuntimeException
> --
>
> Key: NUTCH-2268
> URL: https://issues.apache.org/jira/browse/NUTCH-2268
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.3.1
> Environment: iam using 
> Hbase V:hbase-0.98.19-hadoop2
> Solr V : 6.0.0
> Nutch : 2.3.1
> java : 8
>Reporter: narendra
>  Labels: indexing
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Could you please help out of this error 
> SolrIndexerJob: java.lang.RuntimeException: job 
> failed:name=apache-nutch-2.3.1.jar   
> when i run this commend 
> local/bin/nutch solrindex http://localhost:8983/solr/ -all
> Tried with Solr 4.10.3 but same error iam getting 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)