Re: [VOTE] Release Apache Nutch 2.3.1rc2

2016-01-20 Thread Lewis John Mcgibbney
Hi user@, dev@,
PING on the Nutch 2.3.1 RC#2
Would really appreciate anyone who is able to review this release
candidate. It would mean a lot for our 2.X user base.
Thank you
Lewis

On Sun, Jan 10, 2016 at 7:01 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Folks,
>
> A second candidate for the Nutch 2.3.1 release is available at:
>
> https://dist.apache.org/repos/dist/dev/nutch/2.3.1rc2/
>
> The release candidate is a zip and tar.gz sources archive of the sources
> in:
>
> http://svn.apache.org/repos/asf/nutch/tags/release-2.3.1rc2/
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachenutch-1008/
>
> Please vote on releasing this package as Apache Nutch 2.3.1.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 2.3.1.
> [ ] -1 Do not release this package because…
>
> Cheers,
> Lewis
>
> P.S. Here is my +1.
>
> --
> *Lewis*
>



-- 
*Lewis*


Re: [VOTE] Release Apache Nutch 2.3.1rc2

2016-01-20 Thread Mattmann, Chris A (3980)
My bad I said I would do this!

Here you go it’s done:

+1

SIGS, checksums check out:

[chipotle:~/tmp/apache-nutch-2.3.1-rc2] mattmann%
$HOME/bin/stage_apache_rc apache-nutch 2.3.1-src
https://dist.apache.org/repos/dist/dev/nutch/2.3.1rc2/
  % Total% Received % Xferd  Average Speed   TimeTime Time
Current
 Dload  Upload   Total   SpentLeft
Speed
100 5134k  100 5134k0 0  1468k  0  0:00:03  0:00:03 --:--:--
1468k
  % Total% Received % Xferd  Average Speed   TimeTime Time
Current
 Dload  Upload   Total   SpentLeft
Speed
100   819  100   8190 0   2803  0 --:--:-- --:--:-- --:--:--
2795
  % Total% Received % Xferd  Average Speed   TimeTime Time
Current
 Dload  Upload   Total   SpentLeft
Speed
10069  100690 0213  0 --:--:-- --:--:-- --:--:--
213
  % Total% Received % Xferd  Average Speed   TimeTime Time
Current
 Dload  Upload   Total   SpentLeft
Speed
10078  100780 0236  0 --:--:-- --:--:-- --:--:--
236
  % Total% Received % Xferd  Average Speed   TimeTime Time
Current
 Dload  Upload   Total   SpentLeft
Speed
100 7411k  100 7411k0 0  1487k  0  0:00:04  0:00:04 --:--:--
1603k
  % Total% Received % Xferd  Average Speed   TimeTime Time
Current
 Dload  Upload   Total   SpentLeft
Speed
100   819  100   8190 0   2918  0 --:--:-- --:--:-- --:--:--
2914
  % Total% Received % Xferd  Average Speed   TimeTime Time
Current
 Dload  Upload   Total   SpentLeft
Speed
10066  100660 0226  0 --:--:-- --:--:-- --:--:--
226
  % Total% Received % Xferd  Average Speed   TimeTime Time
Current
 Dload  Upload   Total   SpentLeft
Speed
10075  100750 0252  0 --:--:-- --:--:-- --:--:--
253
[chipotle:~/tmp/apache-nutch-2.3.1-rc2] mattmann%
$HOME/bin/verify_gpg_sigs
Verifying Signature for file apache-nutch-2.3.1-src.tar.gz.asc
gpg: Signature made Sun Jan 10 07:00:20 2016 PST using RSA key ID 48BAEBF6
gpg: Good signature from "Lewis John McGibbney (CODE SIGNING KEY)
"
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the
owner.
Primary key fingerprint: DB7B 5199 121C 08A5 C8F4  052B 3A47 17F0 48BA EBF6
Verifying Signature for file apache-nutch-2.3.1-src.zip.asc
gpg: Signature made Sun Jan 10 07:00:24 2016 PST using RSA key ID 48BAEBF6
gpg: Good signature from "Lewis John McGibbney (CODE SIGNING KEY)
"
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the
owner.
Primary key fingerprint: DB7B 5199 121C 08A5 C8F4  052B 3A47 17F0 48BA EBF6
[chipotle:~/tmp/apache-nutch-2.3.1-rc2] mattmann%
$HOME/bin/verify_md5_checksums
md5sum: stat '*.bz2': No such file or directory
md5sum: stat '*.tgz': No such file or directory
apache-nutch-2.3.1-src.tar.gz: OK
apache-nutch-2.3.1-src.zip: OK
[chipotle:~/tmp/apache-nutch-2.3.1-rc2] mattmann%



++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Lewis John Mcgibbney 
Reply-To: "dev@nutch.apache.org" 
Date: Wednesday, January 20, 2016 at 7:14 PM
To: "u...@nutch.apache.org" ,
"dev@nutch.apache.org" 
Subject: Re: [VOTE] Release Apache Nutch 2.3.1rc2

>Hi user@, dev@,
>
>PING on the Nutch 2.3.1 RC#2
>
>Would really appreciate anyone who is able to review this release
>candidate. It would mean a lot for our 2.X user base.
>
>Thank you
>
>Lewis
>
>
>On Sun, Jan 10, 2016 at 7:01 AM, Lewis John Mcgibbney
> wrote:
>
>Hi Folks,
>
>A second candidate for the Nutch 2.3.1 release is available at:
>
>https://dist.apache.org/repos/dist/dev/nutch/2.3.1rc2/
>
>The release candidate is a zip and tar.gz sources archive of the sources
>in:
>
>http://svn.apache.org/repos/asf/nutch/tags/release-2.3.1rc2/
>
>In addition, a staged maven repository is available here:
>
>https://repository.apache.org/content/repositories/orgapachenutch-1008/
>

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-20 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110217#comment-15110217
 ] 

Tien Nguyen Manh commented on NUTCH-961:


i'm using this patch NUTCH-961-1.11-1.patch, it works fine when run from 
eclipse & run in hadoop. It have problem when i run in local mode
It throws exception: "Can't retrieve Tika parser for mime-type text/html". It 
is not problem with parse-plugins.xml. It seem problem with TikaConfig 
constructor TikaConfig(ClassLoader loader), it failed to load some config via 
classLoader when run in local mode.

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [MASSMAIL]Re: Nutch/Solr communication problem

2016-01-20 Thread Roannel Fernández Hernández
Hi 

Can you share the Solr logs too? 

Regards 

> From: "Zara Parst" 
> To: dev@nutch.apache.org
> Sent: Wednesday, January 20, 2016 4:52:58 AM
> Subject: Re: [MASSMAIL]Re: Nutch/Solr communication problem

> Hi,

> Everyone if you check the log file it does talk about the error, here is
> re-briefing of problem.

> 1. Solr without any authentication => Nutch work successfully and it populate
> the solr core say (abc)
> 2. Solr with protection and Nutch solr.auth=false => unauthorized access
> which make sense.
> 3. Solr with protection and Nutch solr.auth=trur and correct id and pass in
> config file => It spit out the error and I have attached the log at the
> bottom of this email.

> When I use authentication nutch is not able to insert data. However problem
> is not related to solr because if I try to populate data with solr having id
> and password and nutch with solr.auth=false it does print unauthorized
> access and that makes sense. Now with solr.auth=true and id and password in
> nutch-default nutch is not able to insert data and below is the error log. I
> guess is there any user right like admin or content-admin in solr ?? That
> too I tried with all kind of users and always same error. If some one can
> try and see if they can push the data with protected solr. If you are not
> getting error then please tell me what are the configuration you are using
> in detail ?? Treat me like novice and then tell me how to do it. Because I
> tried all kind of permutation of configuration both in solr and nutch side
> without any luck. Please do help me this is a genuine request . I do
> understand you guys are pretty busy with your work its not that i am just
> bothering you without my homework.

> Please see the log

> 2016-01-20 07:02:15,658 INFO indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2016-01-20 07:04:36,366 WARN util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2016-01-20 07:04:36,656 INFO segment.SegmentChecker - Segment dir is
> complete:
> file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119163402.
> 2016-01-20 07:04:36,658 INFO segment.SegmentChecker - Segment dir is
> complete:
> file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119163656.
> 2016-01-20 07:04:36,673 INFO segment.SegmentChecker - Segment dir is
> complete:
> file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119164952.
> 2016-01-20 07:04:36,674 INFO indexer.IndexingJob - Indexer: starting at
> 2016-01-20 07:04:36
> 2016-01-20 07:04:36,676 INFO indexer.IndexingJob - Indexer: deleting gone
> documents: false
> 2016-01-20 07:04:36,676 INFO indexer.IndexingJob - Indexer: URL filtering:
> false
> 2016-01-20 07:04:36,676 INFO indexer.IndexingJob - Indexer: URL normalizing:
> false
> 2016-01-20 07:04:37,036 INFO indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2016-01-20 07:04:37,036 INFO indexer.IndexingJob - Active IndexWriters :
> SolrIndexWriter
> solr.server.type : Type of SolrServer to communicate with (default 'http'
> however options include 'cloud', 'lb' and 'concurrent')
> solr.server.url : URL of the Solr instance (mandatory)
> solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for
> solr.server.type)
> solr.loadbalance.urls : Comma-separated string of Solr server strings to be
> used (madatory if 'lb' value for solr.server.type)
> solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
> solr.commit.size : buffer size when sending to Solr (default 1000)
> solr.auth : use authentication (default false)
> solr.auth.username : username for authentication
> solr.auth.password : password for authentication

> 2016-01-20 07:04:37,039 INFO indexer.IndexerMapReduce - IndexerMapReduce:
> crawldb: yahCrawl/crawldb
> 2016-01-20 07:04:37,039 INFO indexer.IndexerMapReduce - IndexerMapReduce:
> linkdb: yahCrawl/linkdb
> 2016-01-20 07:04:37,039 INFO indexer.IndexerMapReduce - IndexerMapReduces:
> adding segment:
> file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119163402
> 2016-01-20 07:04:37,045 INFO indexer.IndexerMapReduce - IndexerMapReduces:
> adding segment:
> file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119163656
> 2016-01-20 07:04:37,046 INFO indexer.IndexerMapReduce - IndexerMapReduces:
> adding segment:
> file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119164952
> 2016-01-20 07:04:37,047 WARN indexer.IndexerMapReduce - Ignoring linkDb for
> indexing, no linkDb found in path: yahCrawl/linkdb
> 2016-01-20 07:04:38,151 WARN conf.Configuration -
> file:/tmp/hadoop-rakesh/mapred/staging/rakesh1643615475/.staging/job_local1643615475_0001/job.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval; 

Re: [MASSMAIL]Re: Nutch/Solr communication problem

2016-01-20 Thread Zara Parst
Hi,

Everyone if you check the log file it does talk about the error,  here is
re-briefing of problem.

1. Solr without any authentication => Nutch work successfully and it
populate the solr core say (abc)
2. Solr with protection  and Nutch  solr.auth=false  => unauthorized access
which make sense.
3.   Solr with protection  and Nutch  solr.auth=trur and correct id and
pass in config file => It spit out the error and I have attached the log at
the bottom of this email.

When I use authentication nutch is not able to insert data. However problem
is not related to solr because if I try to populate data with solr having
id and password and nutch with solr.auth=false it does print unauthorized
access and that makes sense. Now with solr.auth=true and id and password in
nutch-default nutch is not able to insert data and below is the error log.
I guess is there any user right like admin or content-admin in solr ??
That too I tried with all kind of users and always same error. If some one
can try and see if they can push the data with protected solr. If you are
not getting error then please tell me what are the configuration you are
using in detail ?? Treat me like novice and then tell me how to do it.
Because I tried all kind of permutation of configuration both in solr and
nutch side without any luck. Please do help me this is a genuine request .
I do understand you guys are pretty busy with your work its not that i am
just bothering you without my homework.


Please see the log

2016-01-20 07:02:15,658 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-01-20 07:04:36,366 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2016-01-20 07:04:36,656 INFO  segment.SegmentChecker - Segment dir is
complete:
file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119163402.
2016-01-20 07:04:36,658 INFO  segment.SegmentChecker - Segment dir is
complete:
file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119163656.
2016-01-20 07:04:36,673 INFO  segment.SegmentChecker - Segment dir is
complete:
file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119164952.
2016-01-20 07:04:36,674 INFO  indexer.IndexingJob - Indexer: starting at
2016-01-20 07:04:36
2016-01-20 07:04:36,676 INFO  indexer.IndexingJob - Indexer: deleting gone
documents: false
2016-01-20 07:04:36,676 INFO  indexer.IndexingJob - Indexer: URL filtering:
false
2016-01-20 07:04:36,676 INFO  indexer.IndexingJob - Indexer: URL
normalizing: false
2016-01-20 07:04:37,036 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-01-20 07:04:37,036 INFO  indexer.IndexingJob - Active IndexWriters :
SolrIndexWriter
solr.server.type : Type of SolrServer to communicate with (default 'http'
however options include 'cloud', 'lb' and 'concurrent')
solr.server.url : URL of the Solr instance (mandatory)
solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value
for solr.server.type)
solr.loadbalance.urls : Comma-separated string of Solr server strings to be
used (madatory if 'lb' value for solr.server.type)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.commit.size : buffer size when sending to Solr (default 1000)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication


2016-01-20 07:04:37,039 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
crawldb: yahCrawl/crawldb
2016-01-20 07:04:37,039 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
linkdb: yahCrawl/linkdb
2016-01-20 07:04:37,039 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment:
file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119163402
2016-01-20 07:04:37,045 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment:
file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119163656
2016-01-20 07:04:37,046 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment:
file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119164952
2016-01-20 07:04:37,047 WARN  indexer.IndexerMapReduce - Ignoring linkDb
for indexing, no linkDb found in path: yahCrawl/linkdb
2016-01-20 07:04:38,151 WARN  conf.Configuration -
file:/tmp/hadoop-rakesh/mapred/staging/rakesh1643615475/.staging/job_local1643615475_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-01-20 07:04:38,153 WARN  conf.Configuration -
file:/tmp/hadoop-rakesh/mapred/staging/rakesh1643615475/.staging/job_local1643615475_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-01-20 07:04:38,312 WARN  conf.Configuration -