Re: [VOTE] Release Apache Nutch 2.3.1rc2
Hi user@, dev@, PING on the Nutch 2.3.1 RC#2 Would really appreciate anyone who is able to review this release candidate. It would mean a lot for our 2.X user base. Thank you Lewis On Sun, Jan 10, 2016 at 7:01 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Folks, > > A second candidate for the Nutch 2.3.1 release is available at: > > https://dist.apache.org/repos/dist/dev/nutch/2.3.1rc2/ > > The release candidate is a zip and tar.gz sources archive of the sources > in: > > http://svn.apache.org/repos/asf/nutch/tags/release-2.3.1rc2/ > > In addition, a staged maven repository is available here: > > https://repository.apache.org/content/repositories/orgapachenutch-1008/ > > Please vote on releasing this package as Apache Nutch 2.3.1. > The vote is open for the next 72 hours and passes if a majority of at > least three +1 Nutch PMC votes are cast. > > [ ] +1 Release this package as Apache Nutch 2.3.1. > [ ] -1 Do not release this package because… > > Cheers, > Lewis > > P.S. Here is my +1. > > -- > *Lewis* > -- *Lewis*
Re: [VOTE] Release Apache Nutch 2.3.1rc2
My bad I said I would do this! Here you go it’s done: +1 SIGS, checksums check out: [chipotle:~/tmp/apache-nutch-2.3.1-rc2] mattmann% $HOME/bin/stage_apache_rc apache-nutch 2.3.1-src https://dist.apache.org/repos/dist/dev/nutch/2.3.1rc2/ % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 5134k 100 5134k0 0 1468k 0 0:00:03 0:00:03 --:--:-- 1468k % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 819 100 8190 0 2803 0 --:--:-- --:--:-- --:--:-- 2795 % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 10069 100690 0213 0 --:--:-- --:--:-- --:--:-- 213 % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 10078 100780 0236 0 --:--:-- --:--:-- --:--:-- 236 % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 7411k 100 7411k0 0 1487k 0 0:00:04 0:00:04 --:--:-- 1603k % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 819 100 8190 0 2918 0 --:--:-- --:--:-- --:--:-- 2914 % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 10066 100660 0226 0 --:--:-- --:--:-- --:--:-- 226 % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 10075 100750 0252 0 --:--:-- --:--:-- --:--:-- 253 [chipotle:~/tmp/apache-nutch-2.3.1-rc2] mattmann% $HOME/bin/verify_gpg_sigs Verifying Signature for file apache-nutch-2.3.1-src.tar.gz.asc gpg: Signature made Sun Jan 10 07:00:20 2016 PST using RSA key ID 48BAEBF6 gpg: Good signature from "Lewis John McGibbney (CODE SIGNING KEY)" gpg: WARNING: This key is not certified with a trusted signature! gpg: There is no indication that the signature belongs to the owner. Primary key fingerprint: DB7B 5199 121C 08A5 C8F4 052B 3A47 17F0 48BA EBF6 Verifying Signature for file apache-nutch-2.3.1-src.zip.asc gpg: Signature made Sun Jan 10 07:00:24 2016 PST using RSA key ID 48BAEBF6 gpg: Good signature from "Lewis John McGibbney (CODE SIGNING KEY) " gpg: WARNING: This key is not certified with a trusted signature! gpg: There is no indication that the signature belongs to the owner. Primary key fingerprint: DB7B 5199 121C 08A5 C8F4 052B 3A47 17F0 48BA EBF6 [chipotle:~/tmp/apache-nutch-2.3.1-rc2] mattmann% $HOME/bin/verify_md5_checksums md5sum: stat '*.bz2': No such file or directory md5sum: stat '*.tgz': No such file or directory apache-nutch-2.3.1-src.tar.gz: OK apache-nutch-2.3.1-src.zip: OK [chipotle:~/tmp/apache-nutch-2.3.1-rc2] mattmann% ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Lewis John Mcgibbney Reply-To: "dev@nutch.apache.org" Date: Wednesday, January 20, 2016 at 7:14 PM To: "u...@nutch.apache.org" , "dev@nutch.apache.org" Subject: Re: [VOTE] Release Apache Nutch 2.3.1rc2 >Hi user@, dev@, > >PING on the Nutch 2.3.1 RC#2 > >Would really appreciate anyone who is able to review this release >candidate. It would mean a lot for our 2.X user base. > >Thank you > >Lewis > > >On Sun, Jan 10, 2016 at 7:01 AM, Lewis John Mcgibbney > wrote: > >Hi Folks, > >A second candidate for the Nutch 2.3.1 release is available at: > >https://dist.apache.org/repos/dist/dev/nutch/2.3.1rc2/ > >The release candidate is a zip and tar.gz sources archive of the sources >in: > >http://svn.apache.org/repos/asf/nutch/tags/release-2.3.1rc2/ > >In addition, a staged maven repository is available here: > >https://repository.apache.org/content/repositories/orgapachenutch-1008/ >
[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110217#comment-15110217 ] Tien Nguyen Manh commented on NUTCH-961: i'm using this patch NUTCH-961-1.11-1.patch, it works fine when run from eclipse & run in hadoop. It have problem when i run in local mode It throws exception: "Can't retrieve Tika parser for mime-type text/html". It is not problem with parse-plugins.xml. It seem problem with TikaConfig constructor TikaConfig(ClassLoader loader), it failed to load some config via classLoader when run in local mode. > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [MASSMAIL]Re: Nutch/Solr communication problem
Hi Can you share the Solr logs too? Regards > From: "Zara Parst"> To: dev@nutch.apache.org > Sent: Wednesday, January 20, 2016 4:52:58 AM > Subject: Re: [MASSMAIL]Re: Nutch/Solr communication problem > Hi, > Everyone if you check the log file it does talk about the error, here is > re-briefing of problem. > 1. Solr without any authentication => Nutch work successfully and it populate > the solr core say (abc) > 2. Solr with protection and Nutch solr.auth=false => unauthorized access > which make sense. > 3. Solr with protection and Nutch solr.auth=trur and correct id and pass in > config file => It spit out the error and I have attached the log at the > bottom of this email. > When I use authentication nutch is not able to insert data. However problem > is not related to solr because if I try to populate data with solr having id > and password and nutch with solr.auth=false it does print unauthorized > access and that makes sense. Now with solr.auth=true and id and password in > nutch-default nutch is not able to insert data and below is the error log. I > guess is there any user right like admin or content-admin in solr ?? That > too I tried with all kind of users and always same error. If some one can > try and see if they can push the data with protected solr. If you are not > getting error then please tell me what are the configuration you are using > in detail ?? Treat me like novice and then tell me how to do it. Because I > tried all kind of permutation of configuration both in solr and nutch side > without any luck. Please do help me this is a genuine request . I do > understand you guys are pretty busy with your work its not that i am just > bothering you without my homework. > Please see the log > 2016-01-20 07:02:15,658 INFO indexer.IndexWriters - Adding > org.apache.nutch.indexwriter.solr.SolrIndexWriter > 2016-01-20 07:04:36,366 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2016-01-20 07:04:36,656 INFO segment.SegmentChecker - Segment dir is > complete: > file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119163402. > 2016-01-20 07:04:36,658 INFO segment.SegmentChecker - Segment dir is > complete: > file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119163656. > 2016-01-20 07:04:36,673 INFO segment.SegmentChecker - Segment dir is > complete: > file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119164952. > 2016-01-20 07:04:36,674 INFO indexer.IndexingJob - Indexer: starting at > 2016-01-20 07:04:36 > 2016-01-20 07:04:36,676 INFO indexer.IndexingJob - Indexer: deleting gone > documents: false > 2016-01-20 07:04:36,676 INFO indexer.IndexingJob - Indexer: URL filtering: > false > 2016-01-20 07:04:36,676 INFO indexer.IndexingJob - Indexer: URL normalizing: > false > 2016-01-20 07:04:37,036 INFO indexer.IndexWriters - Adding > org.apache.nutch.indexwriter.solr.SolrIndexWriter > 2016-01-20 07:04:37,036 INFO indexer.IndexingJob - Active IndexWriters : > SolrIndexWriter > solr.server.type : Type of SolrServer to communicate with (default 'http' > however options include 'cloud', 'lb' and 'concurrent') > solr.server.url : URL of the Solr instance (mandatory) > solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for > solr.server.type) > solr.loadbalance.urls : Comma-separated string of Solr server strings to be > used (madatory if 'lb' value for solr.server.type) > solr.mapping.file : name of the mapping file for fields (default > solrindex-mapping.xml) > solr.commit.size : buffer size when sending to Solr (default 1000) > solr.auth : use authentication (default false) > solr.auth.username : username for authentication > solr.auth.password : password for authentication > 2016-01-20 07:04:37,039 INFO indexer.IndexerMapReduce - IndexerMapReduce: > crawldb: yahCrawl/crawldb > 2016-01-20 07:04:37,039 INFO indexer.IndexerMapReduce - IndexerMapReduce: > linkdb: yahCrawl/linkdb > 2016-01-20 07:04:37,039 INFO indexer.IndexerMapReduce - IndexerMapReduces: > adding segment: > file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119163402 > 2016-01-20 07:04:37,045 INFO indexer.IndexerMapReduce - IndexerMapReduces: > adding segment: > file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119163656 > 2016-01-20 07:04:37,046 INFO indexer.IndexerMapReduce - IndexerMapReduces: > adding segment: > file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119164952 > 2016-01-20 07:04:37,047 WARN indexer.IndexerMapReduce - Ignoring linkDb for > indexing, no linkDb found in path: yahCrawl/linkdb > 2016-01-20 07:04:38,151 WARN conf.Configuration - > file:/tmp/hadoop-rakesh/mapred/staging/rakesh1643615475/.staging/job_local1643615475_0001/job.xml:an > attempt to override final parameter: > mapreduce.job.end-notification.max.retry.interval;
Re: [MASSMAIL]Re: Nutch/Solr communication problem
Hi, Everyone if you check the log file it does talk about the error, here is re-briefing of problem. 1. Solr without any authentication => Nutch work successfully and it populate the solr core say (abc) 2. Solr with protection and Nutch solr.auth=false => unauthorized access which make sense. 3. Solr with protection and Nutch solr.auth=trur and correct id and pass in config file => It spit out the error and I have attached the log at the bottom of this email. When I use authentication nutch is not able to insert data. However problem is not related to solr because if I try to populate data with solr having id and password and nutch with solr.auth=false it does print unauthorized access and that makes sense. Now with solr.auth=true and id and password in nutch-default nutch is not able to insert data and below is the error log. I guess is there any user right like admin or content-admin in solr ?? That too I tried with all kind of users and always same error. If some one can try and see if they can push the data with protected solr. If you are not getting error then please tell me what are the configuration you are using in detail ?? Treat me like novice and then tell me how to do it. Because I tried all kind of permutation of configuration both in solr and nutch side without any luck. Please do help me this is a genuine request . I do understand you guys are pretty busy with your work its not that i am just bothering you without my homework. Please see the log 2016-01-20 07:02:15,658 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter 2016-01-20 07:04:36,366 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2016-01-20 07:04:36,656 INFO segment.SegmentChecker - Segment dir is complete: file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119163402. 2016-01-20 07:04:36,658 INFO segment.SegmentChecker - Segment dir is complete: file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119163656. 2016-01-20 07:04:36,673 INFO segment.SegmentChecker - Segment dir is complete: file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119164952. 2016-01-20 07:04:36,674 INFO indexer.IndexingJob - Indexer: starting at 2016-01-20 07:04:36 2016-01-20 07:04:36,676 INFO indexer.IndexingJob - Indexer: deleting gone documents: false 2016-01-20 07:04:36,676 INFO indexer.IndexingJob - Indexer: URL filtering: false 2016-01-20 07:04:36,676 INFO indexer.IndexingJob - Indexer: URL normalizing: false 2016-01-20 07:04:37,036 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter 2016-01-20 07:04:37,036 INFO indexer.IndexingJob - Active IndexWriters : SolrIndexWriter solr.server.type : Type of SolrServer to communicate with (default 'http' however options include 'cloud', 'lb' and 'concurrent') solr.server.url : URL of the Solr instance (mandatory) solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for solr.server.type) solr.loadbalance.urls : Comma-separated string of Solr server strings to be used (madatory if 'lb' value for solr.server.type) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.commit.size : buffer size when sending to Solr (default 1000) solr.auth : use authentication (default false) solr.auth.username : username for authentication solr.auth.password : password for authentication 2016-01-20 07:04:37,039 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: yahCrawl/crawldb 2016-01-20 07:04:37,039 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: yahCrawl/linkdb 2016-01-20 07:04:37,039 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119163402 2016-01-20 07:04:37,045 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119163656 2016-01-20 07:04:37,046 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/home/rakesh/Desktop/arima/nutch/runtime/local/yahCrawl/segements/20160119164952 2016-01-20 07:04:37,047 WARN indexer.IndexerMapReduce - Ignoring linkDb for indexing, no linkDb found in path: yahCrawl/linkdb 2016-01-20 07:04:38,151 WARN conf.Configuration - file:/tmp/hadoop-rakesh/mapred/staging/rakesh1643615475/.staging/job_local1643615475_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 2016-01-20 07:04:38,153 WARN conf.Configuration - file:/tmp/hadoop-rakesh/mapred/staging/rakesh1643615475/.staging/job_local1643615475_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 2016-01-20 07:04:38,312 WARN conf.Configuration -