[jira] [Comment Edited] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-25 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116772#comment-15116772
 ] 

Tien Nguyen Manh edited comment on NUTCH-961 at 1/26/16 6:57 AM:
-

AH yes, Could you explain why we need to parse it twice? with NUTCH-1233 we can 
use just 1 parse?


was (Author: tiennm):
AH yes, Could you explain why we need to parse it twice?

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-25 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116772#comment-15116772
 ] 

Tien Nguyen Manh commented on NUTCH-961:


AH yes, Could you explain why we need to parse it twice?

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2016-01-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2184:

Attachment: NUTCH-2184v2.patch

Updated patch for trunk. [~markus17], working to address your comments now 
thanks for response, i must have missed them.

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2206) Provide example scoring.similarity.stopword.file

2016-01-25 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2206:
---

 Summary: Provide example scoring.similarity.stopword.file
 Key: NUTCH-2206
 URL: https://issues.apache.org/jira/browse/NUTCH-2206
 Project: Nutch
  Issue Type: Bug
  Components: plugin, scoring
Affects Versions: 1.11
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.12


The scoring-similarity plugin does not provide an example file for the property 
scoring.similarity.stopword.file.
This is an issue for a number of reasons, namely 
 * A user does not know what it is meant to look like, and
 * We always check of this file and will [throw an exception if it is not 
found|https://github.com/apache/nutch/blob/trunk/src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine/DocumentVector.java#L79-L80],
 this may not be picked up by the user until much later.

I suggest a simple fix here, simply include the [standard English stop words 
taken from Lucene's 
StopAnalyzer|https://github.com/apache/lucene-solr/blob/3f38aba02ce37c6422875d8824ee034d42d635b9/solr/contrib/morphlines-core/src/test-files/solr/collection1/conf/lang/stopwords_en.txt].
 The comments will help people to easily customize the list to whatever they 
require. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2206) Provide example scoring.similarity.stopword.file

2016-01-25 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116491#comment-15116491
 ] 

Lewis John McGibbney commented on NUTCH-2206:
-

CC [~sujenshah]

> Provide example scoring.similarity.stopword.file
> 
>
> Key: NUTCH-2206
> URL: https://issues.apache.org/jira/browse/NUTCH-2206
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, scoring
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
>
> The scoring-similarity plugin does not provide an example file for the 
> property scoring.similarity.stopword.file.
> This is an issue for a number of reasons, namely 
>  * A user does not know what it is meant to look like, and
>  * We always check of this file and will [throw an exception if it is not 
> found|https://github.com/apache/nutch/blob/trunk/src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine/DocumentVector.java#L79-L80],
>  this may not be picked up by the user until much later.
> I suggest a simple fix here, simply include the [standard English stop words 
> taken from Lucene's 
> StopAnalyzer|https://github.com/apache/lucene-solr/blob/3f38aba02ce37c6422875d8824ee034d42d635b9/solr/contrib/morphlines-core/src/test-files/solr/collection1/conf/lang/stopwords_en.txt].
>  The comments will help people to easily customize the list to whatever they 
> require. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2207) Remove class duplication and smarten-up scoring-similarity plugin

2016-01-25 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2207:
---

 Summary: Remove class duplication and smarten-up 
scoring-similarity plugin
 Key: NUTCH-2207
 URL: https://issues.apache.org/jira/browse/NUTCH-2207
 Project: Nutch
  Issue Type: Improvement
  Components: plugin, scoring
Affects Versions: 1.11
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.12


Right now it appears that DocumentVector.java is duplicated, there is also no 
license header on 
[ScoringFilterModel.java|https://github.com/apache/nutch/blob/trunk/src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/ScoringFilterModel.java].
 I think I've also spotted a number of places that imports are not being used. 
Finally, Javadoc is virtually non-existent for the scoring-similarity plugin at 
all. It would help to augment some documentation. 
It would be very helpful if the [SimilairittScoringFilter wiki 
page|https://wiki.apache.org/nutch/SimilarityScoringFilter] was cited.
We could also do with visiting the wiki page ensuring that all references are 
present.
CC [~sujenshah]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2205) Nutch solrdedup error in solrcloud for doc

2016-01-25 Thread VictorHu (JIRA)
VictorHu created NUTCH-2205:
---

 Summary: Nutch solrdedup error in solrcloud for doc
 Key: NUTCH-2205
 URL: https://issues.apache.org/jira/browse/NUTCH-2205
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Reporter: VictorHu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-25 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114989#comment-15114989
 ] 

Markus Jelsma commented on NUTCH-961:
-

That is probably due to the patch parsing twice. Once with BP for text, and 
once without for link extraction. 

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2205) Nutch solrdedup error in solrcloud for larger docs

2016-01-25 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114991#comment-15114991
 ] 

Markus Jelsma commented on NUTCH-2205:
--

This looks like your cluster was down, not a Nutch error.

> Nutch solrdedup error in solrcloud for larger docs 
> ---
>
> Key: NUTCH-2205
> URL: https://issues.apache.org/jira/browse/NUTCH-2205
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.3
> Environment: CentOS 6.5,Jdk 1.7.0_75,omcat 8.0.9 ,Hadoop 
> 2.5.2,Zookeeper 3.4.6 ,Hbase 0.98.8 ,Solr 4.8.1 ,Nutch 2.3.1
>Reporter: VictorHu
> Fix For: 2.4
>
>
> When the number of solr docs larger than 9000,the solrdedup of the nutch is 
> broken.This is log: 
> http://10.192.1.100:8080/solr/myEnterpriseCollection_shard2_replica2
> 16/01/25 17:02:38 INFO solr.SolrDeleteDuplicates: SolrDeleteDuplicates: 
> starting...
> 16/01/25 17:02:38 INFO solr.SolrDeleteDuplicates: SolrDeleteDuplicates: Solr 
> url: http://10.192.1.100:8080/solr/myEnterpriseCollection_shard2_replica2
> 16/01/25 17:02:39 INFO client.RMProxy: Connecting to ResourceManager at 
> master.Itble/10.192.1.100:8032
> 16/01/25 17:02:43 INFO mapreduce.JobSubmitter: number of splits:1
> 16/01/25 17:02:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
> job_1453104806095_0162
> 16/01/25 17:02:44 INFO impl.YarnClientImpl: Submitted application 
> application_1453104806095_0162
> 16/01/25 17:02:44 INFO mapreduce.Job: The url to track the job: 
> http://master.Itble:8088/proxy/application_1453104806095_0162/
> 16/01/25 17:02:44 INFO mapreduce.Job: Running job: job_1453104806095_0162
> 16/01/25 17:02:54 INFO mapreduce.Job: Job job_1453104806095_0162 running in 
> uber mode : false
> 16/01/25 17:02:54 INFO mapreduce.Job:  map 0% reduce 0%
> 16/01/25 17:03:02 INFO mapreduce.Job: Task Id : 
> attempt_1453104806095_0162_m_00_0, Status : FAILED
> Error: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
> org.apache.solr.client.solrj.SolrServerException: No live SolrServers 
> available to handle this 
> request:[http://10.192.1.100:8080/solr/myEnterpriseCollection_shard2_replica2,
>  http://10.192.1.101:8080/solr/myEnterpriseCollection_shard1_replica2, 
> http://10.192.1.103:8080/solr/myEnterpriseCollection_shard2_replica1]
> at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554)
> at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
> at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
> at 
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
> at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.createRecordReader(SolrDeleteDuplicates.java:291)
> at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.(MapTask.java:492)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:735)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> 16/01/25 17:03:12 INFO mapreduce.Job: Task Id : 
> attempt_1453104806095_0162_m_00_1, Status : FAILED
> Error: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
> org.apache.solr.client.solrj.SolrServerException: No live SolrServers 
> available to handle this 
> request:[http://10.192.1.100:8080/solr/myEnterpriseCollection_shard2_replica2,
>  http://10.192.1.101:8080/solr/myEnterpriseCollection_shard1_replica2, 
> http://10.192.1.103:8080/solr/myEnterpriseCollection_shard2_replica1, 
> http://10.192.1.102:8080/solr/myEnterpriseCollection_shard1_replica1]
> at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554)
> at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
> at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
> at 
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
> at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.createRecordReader(SolrDeleteDuplicates.java:291)
> at 
> 

[jira] [Updated] (NUTCH-2205) Nutch solrdedup error in solrcloud for larger docs

2016-01-25 Thread VictorHu (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

VictorHu updated NUTCH-2205:

Affects Version/s: 2.3
  Environment: CentOS 6.5,Jdk 1.7.0_75,omcat 8.0.9 ,Hadoop 
2.5.2,Zookeeper 3.4.6 ,Hbase 0.98.8 ,Solr 4.8.1 ,Nutch 2.3.1
Fix Version/s: 2.4
  Description: 
When the number of solr docs larger than 9000,the solrdedup of the nutch is 
broken.This is log: 


http://10.192.1.100:8080/solr/myEnterpriseCollection_shard2_replica2
16/01/25 17:02:38 INFO solr.SolrDeleteDuplicates: SolrDeleteDuplicates: 
starting...
16/01/25 17:02:38 INFO solr.SolrDeleteDuplicates: SolrDeleteDuplicates: Solr 
url: http://10.192.1.100:8080/solr/myEnterpriseCollection_shard2_replica2
16/01/25 17:02:39 INFO client.RMProxy: Connecting to ResourceManager at 
master.Itble/10.192.1.100:8032
16/01/25 17:02:43 INFO mapreduce.JobSubmitter: number of splits:1
16/01/25 17:02:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
job_1453104806095_0162
16/01/25 17:02:44 INFO impl.YarnClientImpl: Submitted application 
application_1453104806095_0162
16/01/25 17:02:44 INFO mapreduce.Job: The url to track the job: 
http://master.Itble:8088/proxy/application_1453104806095_0162/
16/01/25 17:02:44 INFO mapreduce.Job: Running job: job_1453104806095_0162
16/01/25 17:02:54 INFO mapreduce.Job: Job job_1453104806095_0162 running in 
uber mode : false
16/01/25 17:02:54 INFO mapreduce.Job:  map 0% reduce 0%
16/01/25 17:03:02 INFO mapreduce.Job: Task Id : 
attempt_1453104806095_0162_m_00_0, Status : FAILED
Error: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
org.apache.solr.client.solrj.SolrServerException: No live SolrServers available 
to handle this 
request:[http://10.192.1.100:8080/solr/myEnterpriseCollection_shard2_replica2, 
http://10.192.1.101:8080/solr/myEnterpriseCollection_shard1_replica2, 
http://10.192.1.103:8080/solr/myEnterpriseCollection_shard2_replica1]
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.createRecordReader(SolrDeleteDuplicates.java:291)
at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.(MapTask.java:492)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:735)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

16/01/25 17:03:12 INFO mapreduce.Job: Task Id : 
attempt_1453104806095_0162_m_00_1, Status : FAILED
Error: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
org.apache.solr.client.solrj.SolrServerException: No live SolrServers available 
to handle this 
request:[http://10.192.1.100:8080/solr/myEnterpriseCollection_shard2_replica2, 
http://10.192.1.101:8080/solr/myEnterpriseCollection_shard1_replica2, 
http://10.192.1.103:8080/solr/myEnterpriseCollection_shard2_replica1, 
http://10.192.1.102:8080/solr/myEnterpriseCollection_shard1_replica1]
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.createRecordReader(SolrDeleteDuplicates.java:291)
at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.(MapTask.java:492)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:735)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

16/01/25 17:03:22 INFO