[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2022-08-09 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577447#comment-17577447
 ] 

Markus Jelsma commented on NUTCH-2959:
--

Nice, thanks to NUTCH-2669 i can pass the issue by using:
{color:#00}ant -Dpackaging.type=jar clean runtime test{color}


The stuff  now builds except that i am stopped by the indexer-elastic plugin, 
it is the same error again that i had some time before as well.

 
{code:java}
    [javac] 
/home/markus/projects/apache/nutch/nutch/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
 err
or: package org.apache.http.impl.nio.client does not exist
    [javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
{code}
I disabled the plugin, the tests seem to pass except for 
{color:#00}TestRobotsMetaProcessor. It complains about Any23.{color}

> Upgrade to Apache Tika 2.4.1
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.19
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2022-08-09 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577420#comment-17577420
 ] 

Markus Jelsma commented on NUTCH-2959:
--

Here's a patch. This patch does not include the change in plugin.xml for any23. 
It is also untested because for some reason i cannot build Nutch anymore, again 
:(
{code:java}
[ivy:resolve]   [FAILED ] 
javax.ws.rs#javax.ws.rs-api;2.1.1!javax.ws.rs-api.${packaging.type}:  (0ms)
[ivy:resolve]    local: tried
[ivy:resolve] 
/home/markus/.ivy2/local/javax.ws.rs/javax.ws.rs-api/2.1.1/${packaging.type}s/javax.ws.rs-api.${packaging.type}
[ivy:resolve]    maven2: tried
[ivy:resolve] 
https://repo1.maven.org/maven2/javax/ws/rs/javax.ws.rs-api/2.1.1/javax.ws.rs-api-2.1.1.${packaging.type}
[ivy:resolve]    apache-snapshot: tried
[ivy:resolve] 
https://repository.apache.org/content/repositories/snapshots/javax/ws/rs/javax.ws.rs-api/2.1.1/javax.ws.rs-api-2.1.1.${packaging.type}
[ivy:resolve]    sonatype: tried
[ivy:resolve] 
https://oss.sonatype.org/content/repositories/releases/javax/ws/rs/javax.ws.rs-api/2.1.1/javax.ws.rs-api-2.1.1.${packaging.type}
[ivy:resolve]   ::
[ivy:resolve]   ::  FAILED DOWNLOADS    ::
[ivy:resolve]   :: ^ see resolution messages for details  ^ ::
[ivy:resolve]   ::
[ivy:resolve]   :: 
javax.ws.rs#javax.ws.rs-api;2.1.1!javax.ws.rs-api.${packaging.type}
[ivy:resolve]   ::
{code}
I cleared my Ivy cache, created a clean checkout. Some other build error 
mysteriously solved itself, now we see this one. I haven´t seen this error in a 
long time.

> Upgrade to Apache Tika 2.4.1
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.19
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2022-08-09 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2959:
-
Attachment: NUTCH-2959.patch

> Upgrade to Apache Tika 2.4.1
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.19
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2022-08-09 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2959:


 Summary: Upgrade to Apache Tika 2.4.1
 Key: NUTCH-2959
 URL: https://issues.apache.org/jira/browse/NUTCH-2959
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
 Fix For: 1.19






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Release 1.19 ?

2022-08-09 Thread Markus Jelsma
Sounds good!

I see we're still at Tika 2.3.0, i'll submit a patch to upgrade to the
current 2.4.1.

Thanks!
Markus

Op di 9 aug. 2022 om 09:11 schreef Sebastian Nagel :

> Hi all,
>
> more than 60 issues are done for Nutch 1.19
>
>   https://issues.apache.org/jira/projects/NUTCH/versions/12349580
>
> including
>  - important dependency upgrades
>- Hadoop 3.3.3
>- Any23 2.7
>- Tika 2.3.0
>  - plugin-specific URL stream handlers (NUTCH-2429)
>  - migration
>- from Java/JDK 8 to 11
>- from Log4j 1 to Log4j 2
>
> ... and various other fixes and improvements.
>
> The last release (1.18) happened in January 2021, so it's definitely high
> time
> to release 1.19. As usual, we'll check all remaining issues whether they
> should
> be fixed now or can be done in a later release.
>
> I would be ready to push a release candidate during the next two weeks and
> will start to work through the remaining issues and also check for
> dependency
> upgrades required to address potential vulnerabilities. Please, comment on
> issues you want to get fixed already in 1.19! Reviews of open pull
> requests and
> patches are also welcome!
>
> Thanks,
> Sebastian
>


[jira] [Commented] (NUTCH-2861) Remove parse-swf

2022-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577341#comment-17577341
 ] 

ASF GitHub Bot commented on NUTCH-2861:
---

sebastian-nagel opened a new pull request, #742:
URL: https://github.com/apache/nutch/pull/742

   Removes the plugin parse-swf and associated references in LICENSE.txt and 
NOTICE.txt




> Remove parse-swf
> 
>
> Key: NUTCH-2861
> URL: https://issues.apache.org/jira/browse/NUTCH-2861
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
>  Labels: help-wanted
> Fix For: 1.19
>
>
> We should consider to remove the Shockwafe Flash parser plugin 
> ([parse-swf|https://github.com/apache/nutch/tree/master/src/plugin/parse-swf]):
> - Shockwave/[Adobe Flash| https://en.wikipedia.org/wiki/Adobe_Flash] reached 
> [end-of-life|https://helpx.adobe.com/shockwave/shockwave-end-of-life-faq.html]
> - major browsers now block playing Flash content
> - the plugin is based on 15-year old library 
> ([javaswf|https://github.com/apache/nutch/tree/master/src/plugin/parse-swf/lib]),
>  not maintained anymore and not available on Maven repository
> - it's shipped in binary form also in the source package which contradicts 
> the [Apache release 
> policy|https://www.apache.org/legal/release-policy.html#source-packages]
> Notes:
> - should place a notice about the removal in the release not, as parse-tika 
> is not able to extract textual content from *.swf files
> - do not forget to unregister the plugin in 
> [parse-plugins.xml|https://github.com/apache/nutch/blob/6c02da053d8ce65e0283a144ab59586e563608b8/conf/parse-plugins.xml.template#L54]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] sebastian-nagel opened a new pull request, #742: NUTCH-2861 Remove parse-swf

2022-08-09 Thread GitBox


sebastian-nagel opened a new pull request, #742:
URL: https://github.com/apache/nutch/pull/742

   Removes the plugin parse-swf and associated references in LICENSE.txt and 
NOTICE.txt


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (NUTCH-2861) Remove parse-swf

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2861:
--

Assignee: Sebastian Nagel

> Remove parse-swf
> 
>
> Key: NUTCH-2861
> URL: https://issues.apache.org/jira/browse/NUTCH-2861
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
>  Labels: help-wanted
> Fix For: 1.19
>
>
> We should consider to remove the Shockwafe Flash parser plugin 
> ([parse-swf|https://github.com/apache/nutch/tree/master/src/plugin/parse-swf]):
> - Shockwave/[Adobe Flash| https://en.wikipedia.org/wiki/Adobe_Flash] reached 
> [end-of-life|https://helpx.adobe.com/shockwave/shockwave-end-of-life-faq.html]
> - major browsers now block playing Flash content
> - the plugin is based on 15-year old library 
> ([javaswf|https://github.com/apache/nutch/tree/master/src/plugin/parse-swf/lib]),
>  not maintained anymore and not available on Maven repository
> - it's shipped in binary form also in the source package which contradicts 
> the [Apache release 
> policy|https://www.apache.org/legal/release-policy.html#source-packages]
> Notes:
> - should place a notice about the removal in the release not, as parse-tika 
> is not able to extract textual content from *.swf files
> - do not forget to unregister the plugin in 
> [parse-plugins.xml|https://github.com/apache/nutch/blob/6c02da053d8ce65e0283a144ab59586e563608b8/conf/parse-plugins.xml.template#L54]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-2821) Deduplicate licenses in LICENSE.txt file

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2821:
--

Assignee: Sebastian Nagel

> Deduplicate licenses in LICENSE.txt file
> 
>
> Key: NUTCH-2821
> URL: https://issues.apache.org/jira/browse/NUTCH-2821
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.17
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.19
>
>
> The LICENSE.txt contains duplicate licenses (esp. the Apache license) which 
> should be removed. Thanks @jmclean for the hint. Cf. NUTCH-723 which already 
> discussed the topic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-2290) Update licenses of bundled libraries

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2290:
--

Assignee: Sebastian Nagel

> Update licenses of bundled libraries
> 
>
> Key: NUTCH-2290
> URL: https://issues.apache.org/jira/browse/NUTCH-2290
> Project: Nutch
>  Issue Type: Bug
>  Components: deployment
>Affects Versions: 2.3.1, 1.12
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
>  Labels: help-wanted
> Fix For: 1.19
>
> Attachments: 3rd-party-license-report.sh, 
> 3rd-party-licenses-nutch-1.15.txt, apache_nutch_1.17_3rd_party_licenses.txt
>
>
> The files LICENSE.txt and NOTICE.txt were last edited 5 years ago and should 
> be updated to include all licenses of dependencies (and their dependencies) 
> in accordance to [Assembling LICENSE and NOTICE 
> HOWTO|http://www.apache.org/dev/licensing-howto.html]:
> # check for missing or obsolete licenses due to added or removed dependencies
> # update year in NOTICE.txt -- should be a range according to the licensing 
> HOWTO
> # bundled libraries are referenced with path and version number, e.g 
> {{lib/icu4j-4_0_1.jar}}. This would require to update the LICENSE.txt with 
> every dependency upgrade. A more generic reference ("ICU4J") would be easier 
> to maintain but the HOWTO requires to "specify the version of the dependency 
> as licenses are sometimes changed".
> # try to reduce the size of LICENSE.txt (currently 5800 lines). Mainly, 
> according to the HOWTO there is no need to repeat the Apache license again 
> and again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2956) index-geoip: dependency upgrades and improvements

2022-08-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577334#comment-17577334
 ] 

Hudson commented on NUTCH-2956:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #79 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/79/])
NUTCH-2956 index-geoip: dependency upgrades and improvements (snagel: 
[https://github.com/apache/nutch/commit/8fc4f17acc5da28c22ef4e77c2316e20e5976f02])
* (edit) src/plugin/index-geoip/plugin.xml
* (edit) 
src/plugin/index-geoip/src/java/org/apache/nutch/indexer/geoip/GeoIPDocumentCreator.java
* (edit) src/plugin/indexer-solr/schema.xml
* (edit) 
src/plugin/index-geoip/src/java/org/apache/nutch/indexer/geoip/GeoIPIndexingFilter.java
* (edit) conf/nutch-default.xml
* (edit) src/plugin/index-geoip/ivy.xml


> index-geoip: dependency upgrades and improvements
> -
>
> Key: NUTCH-2956
> URL: https://issues.apache.org/jira/browse/NUTCH-2956
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> Upgrades and improvements to the index-geoip plugin:
> - upgrade the geoip2 dependencies
> - exclude transitive dependencies (jackson libs) also provided by Nutch core 
> deps
> - allow to read {{GeoLite2-\*.mmdb}} files without the need to rename them to 
> {{GeoIP2-\*.mmdb}}
> - review index field names in plugin and Nutch Solr schema:
>   -* fix typos in field names
>   -* remove unused fields from schema



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2952) Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)

2022-08-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577286#comment-17577286
 ] 

Hudson commented on NUTCH-2952:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #78 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/78/])
NUTCH-2952 Upgrade core dependencies (snagel: 
[https://github.com/apache/nutch/commit/e71841fd0f1777ece6dde2115ea7c5b036bb13f1])
* (edit) src/plugin/publish-rabbitmq/ivy.xml
* (edit) ivy/ivy.xml


> Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)
> --
>
> Key: NUTCH-2952
> URL: https://issues.apache.org/jira/browse/NUTCH-2952
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> Upgrade the core dependencies to Hadoop 3.3.3 and log4j 2.17.2 - and some 
> more.
> - [Hadoop 3.3.3|https://hadoop.apache.org/docs/r3.3.3/index.html] announces 
> full support for Java 11 and ARM architectures



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

2022-08-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577288#comment-17577288
 ] 

Hudson commented on NUTCH-2936:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #78 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/78/])
NUTCH-2936 Early registration of URL stream handlers provided by plugins may 
fail Hadoop jobs running in distributed mode if protocol-okhttp is used 
(snagel: 
[https://github.com/apache/nutch/commit/03e0ffda4e0c7a31c033541e937a742fe798608a])
* (edit) 
src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/OkHttp.java
NUTCH-2936 Early registration of URL stream handlers provided by plugins may 
fail Hadoop jobs running in distributed mode if protocol-okhttp is used 
(snagel: 
[https://github.com/apache/nutch/commit/1f5f3e4d42b8dfb8bf741b11c9f39cc8bcd34091])
* (edit) src/java/org/apache/nutch/plugin/Extension.java
* (edit) src/java/org/apache/nutch/plugin/PluginRepository.java
* (edit) src/java/org/apache/nutch/plugin/Plugin.java
* (edit) src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java
NUTCH-2936 Early registration of URL stream handlers provided by plugins may 
fail Hadoop jobs (snagel: 
[https://github.com/apache/nutch/commit/487110b07a8b085c5546b58a2157268b3d21cb19])
* (edit) src/java/org/apache/nutch/plugin/PluginRepository.java
* (edit) src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java


> Early registration of URL stream handlers provided by plugins may fail Hadoop 
> jobs running in distributed mode if protocol-okhttp is used
> -
>
> Key: NUTCH-2936
> URL: https://issues.apache.org/jira/browse/NUTCH-2936
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.19
>
>
> After merging NUTCH-2429 I've observed that Nutch jobs running in distributed 
> mode may fail early with the following dubious error:
> {noformat}
> 2022-01-14 13:11:45,751 ERROR crawl.DedupRedirectsJob: DeduplicationJob: 
> java.io.IOException: Error generating shuffle secret key
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:182)
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565)
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562)
> at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1583)
> at 
> org.apache.nutch.crawl.DedupRedirectsJob.run(DedupRedirectsJob.java:301)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at 
> org.apache.nutch.crawl.DedupRedirectsJob.main(DedupRedirectsJob.java:379)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
> Caused by: java.security.NoSuchAlgorithmException: HmacSHA1 KeyGenerator not 
> available
> at java.base/javax.crypto.KeyGenerator.(KeyGenerator.java:177)
> at 
> java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179)
> ... 16 more
> {noformat}
> After removing the early registration of URL stream handlers (see NUTCH-2429) 
> in NutchJob and NutchTool, the job starts without errors.
> Notes:
> - the job this error was observed a [custom de-duplication 
> job|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/DedupRedirectsJob.java]
>  to flag redirects pointing to the same target URL. But I'll try to reproduce 
> it with a standard Nutch job and in pseudo-distributed mode.
> - should also verify whether registering URL stream handlers works at all in 
> distributed mode. Tasks are launched differently, not as NutchJob or 
> NutchTool.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2953) Indexer Elastic to ignore SSL issues

2022-08-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577287#comment-17577287
 ] 

Hudson commented on NUTCH-2953:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #78 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/78/])
NUTCH-2953 Indexer Elastic to ignore SSL issues (snagel: 
[https://github.com/apache/nutch/commit/01ab00b6cd8dbba8abbf1d3840a09bab929c6af0])
* (edit) 
src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java


> Indexer Elastic to ignore SSL issues
> 
>
> Key: NUTCH-2953
> URL: https://issues.apache.org/jira/browse/NUTCH-2953
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.18
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
>  Labels: PatchAvailable, patch-available
> Fix For: 1.19
>
> Attachments: NUTCH-2953-1.patch, NUTCH-2953-2.patch, NUTCH-2953.patch
>
>
> IndexerElastic (in 1.18) has no support for transporting over HTTPS, but 1.19 
> does. But 1.19 has no support for ignore SSL issues that are common with 
> self-signed certificates.
> This patch is for 1.18 only and was made without knowing SSL support was 
> already there in master. Hence, the difference in config property naming, 
> protocol (1.18/patch)  == scheme (1.19).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2956) index-geoip: dependency upgrades and improvements

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2956.

Resolution: Implemented

Merged. Thanks, [~markus17]!

> index-geoip: dependency upgrades and improvements
> -
>
> Key: NUTCH-2956
> URL: https://issues.apache.org/jira/browse/NUTCH-2956
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> Upgrades and improvements to the index-geoip plugin:
> - upgrade the geoip2 dependencies
> - exclude transitive dependencies (jackson libs) also provided by Nutch core 
> deps
> - allow to read {{GeoLite2-\*.mmdb}} files without the need to rename them to 
> {{GeoIP2-\*.mmdb}}
> - review index field names in plugin and Nutch Solr schema:
>   -* fix typos in field names
>   -* remove unused fields from schema



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2956) index-geoip: dependency upgrades and improvements

2022-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577278#comment-17577278
 ] 

ASF GitHub Bot commented on NUTCH-2956:
---

sebastian-nagel merged PR #738:
URL: https://github.com/apache/nutch/pull/738




> index-geoip: dependency upgrades and improvements
> -
>
> Key: NUTCH-2956
> URL: https://issues.apache.org/jira/browse/NUTCH-2956
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> Upgrades and improvements to the index-geoip plugin:
> - upgrade the geoip2 dependencies
> - exclude transitive dependencies (jackson libs) also provided by Nutch core 
> deps
> - allow to read {{GeoLite2-\*.mmdb}} files without the need to rename them to 
> {{GeoIP2-\*.mmdb}}
> - review index field names in plugin and Nutch Solr schema:
>   -* fix typos in field names
>   -* remove unused fields from schema



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] sebastian-nagel merged pull request #738: NUTCH-2956 index-geoip: dependency upgrades and improvements

2022-08-09 Thread GitBox


sebastian-nagel merged PR #738:
URL: https://github.com/apache/nutch/pull/738


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-2956) index-geoip: dependency upgrades and improvements

2022-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577240#comment-17577240
 ] 

ASF GitHub Bot commented on NUTCH-2956:
---

sebastian-nagel commented on PR #738:
URL: https://github.com/apache/nutch/pull/738#issuecomment-1209049781

   Resolved conflicts with the current master after merging #734 which already 
includes an upgrade of Jackson dependencies.




> index-geoip: dependency upgrades and improvements
> -
>
> Key: NUTCH-2956
> URL: https://issues.apache.org/jira/browse/NUTCH-2956
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> Upgrades and improvements to the index-geoip plugin:
> - upgrade the geoip2 dependencies
> - exclude transitive dependencies (jackson libs) also provided by Nutch core 
> deps
> - allow to read {{GeoLite2-\*.mmdb}} files without the need to rename them to 
> {{GeoIP2-\*.mmdb}}
> - review index field names in plugin and Nutch Solr schema:
>   -* fix typos in field names
>   -* remove unused fields from schema



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] sebastian-nagel commented on pull request #738: NUTCH-2956 index-geoip: dependency upgrades and improvements

2022-08-09 Thread GitBox


sebastian-nagel commented on PR #738:
URL: https://github.com/apache/nutch/pull/738#issuecomment-1209049781

   Resolved conflicts with the current master after merging #734 which already 
includes an upgrade of Jackson dependencies.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (NUTCH-2953) Indexer Elastic to ignore SSL issues

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2953.

Resolution: Implemented

Merged PR. Thanks, [~markus17] !

> Indexer Elastic to ignore SSL issues
> 
>
> Key: NUTCH-2953
> URL: https://issues.apache.org/jira/browse/NUTCH-2953
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.18
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
>  Labels: PatchAvailable, patch-available
> Fix For: 1.19
>
> Attachments: NUTCH-2953-1.patch, NUTCH-2953-2.patch, NUTCH-2953.patch
>
>
> IndexerElastic (in 1.18) has no support for transporting over HTTPS, but 1.19 
> does. But 1.19 has no support for ignore SSL issues that are common with 
> self-signed certificates.
> This patch is for 1.18 only and was made without knowing SSL support was 
> already there in master. Hence, the difference in config property naming, 
> protocol (1.18/patch)  == scheme (1.19).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2953) Indexer Elastic to ignore SSL issues

2022-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577228#comment-17577228
 ] 

ASF GitHub Bot commented on NUTCH-2953:
---

sebastian-nagel merged PR #741:
URL: https://github.com/apache/nutch/pull/741




> Indexer Elastic to ignore SSL issues
> 
>
> Key: NUTCH-2953
> URL: https://issues.apache.org/jira/browse/NUTCH-2953
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.18
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
>  Labels: PatchAvailable, patch-available
> Fix For: 1.19
>
> Attachments: NUTCH-2953-1.patch, NUTCH-2953-2.patch, NUTCH-2953.patch
>
>
> IndexerElastic (in 1.18) has no support for transporting over HTTPS, but 1.19 
> does. But 1.19 has no support for ignore SSL issues that are common with 
> self-signed certificates.
> This patch is for 1.18 only and was made without knowing SSL support was 
> already there in master. Hence, the difference in config property naming, 
> protocol (1.18/patch)  == scheme (1.19).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] sebastian-nagel merged pull request #741: NUTCH-2953 Indexer Elastic to ignore SSL issues

2022-08-09 Thread GitBox


sebastian-nagel merged PR #741:
URL: https://github.com/apache/nutch/pull/741


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (NUTCH-2949) Tasks of a multi-threaded map runner may fail because of slow creation of URL stream handlers

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2949.

  Assignee: Sebastian Nagel
Resolution: Fixed

Fixed via NUTCH-2936 / [PR#733|https://github.com/apache/nutch/pull/733].

> Tasks of a multi-threaded map runner may fail because of slow creation of URL 
> stream handlers
> -
>
> Key: NUTCH-2949
> URL: https://issues.apache.org/jira/browse/NUTCH-2949
> Project: Nutch
>  Issue Type: Bug
>  Components: net, plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Blocker
> Fix For: 1.19
>
>
> While running a custom Nutch job ([code 
> here|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/SitemapInjector.java]),
>  many but not all task failed exceeding the the Hadoop task time-out 
> (`mapreduce.task.timeout`) without generating any "heartbeat" (output, 
> counter increments, log messages). Hadoop logs the stacks of all threads of 
> the timed out task. That's the base for the excerpts below.
> The job runs a MultithreadedMapper - most of the mapper threads (48 in total) 
> are waiting for the URLStreamHandler in order to construct a java.net.URL 
> object:
> {noformat}
> "Thread-11" #27 prio=5 os_prio=0 cpu=243.78ms elapsed=647.25s 
> tid=0x7f3eb5b0f800 nid=0x8e651 waiting for monitor entry  
> [0x7f3e84ef9000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
>         - waiting to lock <0x0006a1bc0630> (a java.lang.String)
>         at 
> org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597)
>         at 
> org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95)
>         at java.net.URL.getURLStreamHandler(java.base@11.0.15/URL.java:1432)
>         at java.net.URL.(java.base@11.0.15/URL.java:651)
>         at java.net.URL.(java.base@11.0.15/URL.java:541)
>         at java.net.URL.(java.base@11.0.15/URL.java:488)
>         at 
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:179)
>         at 
> org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:318)
>         at 
> org.apache.nutch.crawl.Injector$InjectMapper.filterNormalize(Injector.java:157)
>         at 
> org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.getContent(SitemapInjector.java:670)
>         at 
> org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.process(SitemapInjector.java:439)
>         at 
> org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:325)
>         at 
> org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:145)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>         at 
> org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:274)
> {noformat}
> Only a single mapper thread is active:
> {noformat}
> "Thread-23" #39 prio=5 os_prio=0 cpu=5830.17ms elapsed=647.09s 
> tid=0x7f3eb5b42800 nid=0x8e661 in Object.wait()  [0x7f3e842ec000]
>java.lang.Thread.State: RUNNABLE
> at 
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(java.base@11.0.15/Native
>  Method)
> at 
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(java.base@11.0.15/NativeConstructorAccessorImpl.java:62)
> at 
> jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(java.base@11.0.15/DelegatingConstructorAccessorImpl.java:45)
> at 
> java.lang.reflect.Constructor.newInstance(java.base@11.0.15/Constructor.java:490)
> at 
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:170)
> - locked <0x0006a1bc0630> (a java.lang.String)
> at 
> org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597)
> at 
> org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95)
> at java.net.URL.getURLStreamHandler(java.base@11.0.15/URL.java:1432)
> at java.net.URL.(java.base@11.0.15/URL.java:651)
> at java.net.URL.(java.base@11.0.15/URL.java:541)
> at java.net.URL.(java.base@11.0.15/URL.java:488)
> at 
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:179)
> at 
> org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:318)
> at 
> 

[jira] [Commented] (NUTCH-2945) Solr Index Writer pluging schema.xml missing a copyToField

2022-08-09 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577220#comment-17577220
 ] 

Sebastian Nagel commented on NUTCH-2945:


NUTCH-2957 should fix this issue.

> Solr Index Writer pluging schema.xml missing a copyToField
> --
>
> Key: NUTCH-2945
> URL: https://issues.apache.org/jira/browse/NUTCH-2945
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.19
> Environment: Solr 8.5.1
> OpenJDK 11
> Ubuntu 22.04
>  
>Reporter: Danielle Fisla
>Assignee: Sebastian Nagel
>Priority: Blocker
> Fix For: 1.19
>
> Attachments: schema.xml
>
>
> Solr Index Writer plugin schema.xml missing a copyToField
>  
> java.lang.Exception: 
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
> from server at http://localhost:8983/solr/nutch: copyF
> ield dest :'description_str' is not an explicit field and doesn't match a 
> dynamicField.
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492) 
> ~[hadoop-mapreduce-client-common-3.1.3.jar:?]
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:559) 
> [hadoop-mapreduce-client-common-3.1.3.jar:?]
> Caused by: 
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
> from server at http://localhost:8983/solr/nutch: copyField dest 
> :'description_str' is not an explicit field and doesn't match a dynamicField.
>         at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:665)
>  ~[?:?]
>         at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
>  ~[?:?]
>         at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
>  ~[?:?]
>         at 
> org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290) ~[?:?]
>         at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:250)
>  ~[?:?]
>         at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:219)
>  ~[?:?]
>         at 
> org.apache.nutch.indexer.IndexWriters.commit(IndexWriters.java:264) 
> ~[apache-nutch-1.19-SNAPSHOT.jar:?]
>         at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54)
>  ~[apache-nutch-1.19-SNAPSHOT.jar:?]
>         at 
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:551)
>  ~[hadoop-mapreduce-client-core-3.1.3.jar:?]
>         at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:630) 
> ~[hadoop-mapreduce-client-core-3.1.3.jar:?]
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) 
> ~[hadoop-mapreduce-client-core-3.1.3.jar:?]
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:347)
>  ~[hadoop-mapreduce-client-common-3.1.3.jar:?]
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>  ~[?:?]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>  ~[?:?]
>         at java.lang.Thread.run(Thread.java:833) ~[?:?]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-2945) Solr Index Writer pluging schema.xml missing a copyToField

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2945:
--

Assignee: Sebastian Nagel

> Solr Index Writer pluging schema.xml missing a copyToField
> --
>
> Key: NUTCH-2945
> URL: https://issues.apache.org/jira/browse/NUTCH-2945
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.19
> Environment: Solr 8.5.1
> OpenJDK 11
> Ubuntu 22.04
>  
>Reporter: Danielle Fisla
>Assignee: Sebastian Nagel
>Priority: Blocker
> Fix For: 1.19
>
> Attachments: schema.xml
>
>
> Solr Index Writer plugin schema.xml missing a copyToField
>  
> java.lang.Exception: 
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
> from server at http://localhost:8983/solr/nutch: copyF
> ield dest :'description_str' is not an explicit field and doesn't match a 
> dynamicField.
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492) 
> ~[hadoop-mapreduce-client-common-3.1.3.jar:?]
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:559) 
> [hadoop-mapreduce-client-common-3.1.3.jar:?]
> Caused by: 
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
> from server at http://localhost:8983/solr/nutch: copyField dest 
> :'description_str' is not an explicit field and doesn't match a dynamicField.
>         at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:665)
>  ~[?:?]
>         at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
>  ~[?:?]
>         at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
>  ~[?:?]
>         at 
> org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290) ~[?:?]
>         at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:250)
>  ~[?:?]
>         at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:219)
>  ~[?:?]
>         at 
> org.apache.nutch.indexer.IndexWriters.commit(IndexWriters.java:264) 
> ~[apache-nutch-1.19-SNAPSHOT.jar:?]
>         at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54)
>  ~[apache-nutch-1.19-SNAPSHOT.jar:?]
>         at 
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:551)
>  ~[hadoop-mapreduce-client-core-3.1.3.jar:?]
>         at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:630) 
> ~[hadoop-mapreduce-client-core-3.1.3.jar:?]
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) 
> ~[hadoop-mapreduce-client-core-3.1.3.jar:?]
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:347)
>  ~[hadoop-mapreduce-client-common-3.1.3.jar:?]
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>  ~[?:?]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>  ~[?:?]
>         at java.lang.Thread.run(Thread.java:833) ~[?:?]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2952) Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2952.

Resolution: Implemented

> Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)
> --
>
> Key: NUTCH-2952
> URL: https://issues.apache.org/jira/browse/NUTCH-2952
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> Upgrade the core dependencies to Hadoop 3.3.3 and log4j 2.17.2 - and some 
> more.
> - [Hadoop 3.3.3|https://hadoop.apache.org/docs/r3.3.3/index.html] announces 
> full support for Java 11 and ARM architectures



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2952) Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)

2022-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577219#comment-17577219
 ] 

ASF GitHub Bot commented on NUTCH-2952:
---

sebastian-nagel merged PR #734:
URL: https://github.com/apache/nutch/pull/734




> Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)
> --
>
> Key: NUTCH-2952
> URL: https://issues.apache.org/jira/browse/NUTCH-2952
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> Upgrade the core dependencies to Hadoop 3.3.3 and log4j 2.17.2 - and some 
> more.
> - [Hadoop 3.3.3|https://hadoop.apache.org/docs/r3.3.3/index.html] announces 
> full support for Java 11 and ARM architectures



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] sebastian-nagel merged pull request #734: NUTCH-2952 Upgrade core dependencies

2022-08-09 Thread GitBox


sebastian-nagel merged PR #734:
URL: https://github.com/apache/nutch/pull/734


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2936.

Resolution: Fixed

Thanks, everybody!

> Early registration of URL stream handlers provided by plugins may fail Hadoop 
> jobs running in distributed mode if protocol-okhttp is used
> -
>
> Key: NUTCH-2936
> URL: https://issues.apache.org/jira/browse/NUTCH-2936
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.19
>
>
> After merging NUTCH-2429 I've observed that Nutch jobs running in distributed 
> mode may fail early with the following dubious error:
> {noformat}
> 2022-01-14 13:11:45,751 ERROR crawl.DedupRedirectsJob: DeduplicationJob: 
> java.io.IOException: Error generating shuffle secret key
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:182)
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565)
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562)
> at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1583)
> at 
> org.apache.nutch.crawl.DedupRedirectsJob.run(DedupRedirectsJob.java:301)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at 
> org.apache.nutch.crawl.DedupRedirectsJob.main(DedupRedirectsJob.java:379)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
> Caused by: java.security.NoSuchAlgorithmException: HmacSHA1 KeyGenerator not 
> available
> at java.base/javax.crypto.KeyGenerator.(KeyGenerator.java:177)
> at 
> java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179)
> ... 16 more
> {noformat}
> After removing the early registration of URL stream handlers (see NUTCH-2429) 
> in NutchJob and NutchTool, the job starts without errors.
> Notes:
> - the job this error was observed a [custom de-duplication 
> job|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/DedupRedirectsJob.java]
>  to flag redirects pointing to the same target URL. But I'll try to reproduce 
> it with a standard Nutch job and in pseudo-distributed mode.
> - should also verify whether registering URL stream handlers works at all in 
> distributed mode. Tasks are launched differently, not as NutchJob or 
> NutchTool.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

2022-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577217#comment-17577217
 ] 

ASF GitHub Bot commented on NUTCH-2936:
---

sebastian-nagel merged PR #733:
URL: https://github.com/apache/nutch/pull/733




> Early registration of URL stream handlers provided by plugins may fail Hadoop 
> jobs running in distributed mode if protocol-okhttp is used
> -
>
> Key: NUTCH-2936
> URL: https://issues.apache.org/jira/browse/NUTCH-2936
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.19
>
>
> After merging NUTCH-2429 I've observed that Nutch jobs running in distributed 
> mode may fail early with the following dubious error:
> {noformat}
> 2022-01-14 13:11:45,751 ERROR crawl.DedupRedirectsJob: DeduplicationJob: 
> java.io.IOException: Error generating shuffle secret key
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:182)
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565)
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562)
> at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1583)
> at 
> org.apache.nutch.crawl.DedupRedirectsJob.run(DedupRedirectsJob.java:301)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at 
> org.apache.nutch.crawl.DedupRedirectsJob.main(DedupRedirectsJob.java:379)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
> Caused by: java.security.NoSuchAlgorithmException: HmacSHA1 KeyGenerator not 
> available
> at java.base/javax.crypto.KeyGenerator.(KeyGenerator.java:177)
> at 
> java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179)
> ... 16 more
> {noformat}
> After removing the early registration of URL stream handlers (see NUTCH-2429) 
> in NutchJob and NutchTool, the job starts without errors.
> Notes:
> - the job this error was observed a [custom de-duplication 
> job|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/DedupRedirectsJob.java]
>  to flag redirects pointing to the same target URL. But I'll try to reproduce 
> it with a standard Nutch job and in pseudo-distributed mode.
> - should also verify whether registering URL stream handlers works at all in 
> distributed mode. Tasks are launched differently, not as NutchJob or 
> NutchTool.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] sebastian-nagel merged pull request #733: NUTCH-2936 / NUTCH-2949 URLStreamHandler may fail jobs in distributed mode

2022-08-09 Thread GitBox


sebastian-nagel merged PR #733:
URL: https://github.com/apache/nutch/pull/733


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (NUTCH-2877) fireant upgrade dependency t-digest in ivy/ivy.xml from 3.2 to 3.3

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2877:
---
Fix Version/s: (was: 1.19)

> fireant upgrade dependency t-digest in ivy/ivy.xml from 3.2 to 3.3
> --
>
> Key: NUTCH-2877
> URL: https://issues.apache.org/jira/browse/NUTCH-2877
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, fireant
>Reporter: Lewis John McGibbney
>Priority: Major
>
> fireant upgrade dependency t-digest in ivy/ivy.xml from 3.2 to 3.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2878) fireant upgrade dependency hadoop-common in ivy/ivy.xml from 3.1.3 to 3.3.1

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-2878.
--
Resolution: Invalid

> fireant upgrade dependency hadoop-common in ivy/ivy.xml from 3.1.3 to 3.3.1
> ---
>
> Key: NUTCH-2878
> URL: https://issues.apache.org/jira/browse/NUTCH-2878
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, fireant
>Reporter: Lewis John McGibbney
>Priority: Major
>
> fireant upgrade dependency hadoop-common in ivy/ivy.xml from 3.1.3 to 3.3.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2878) fireant upgrade dependency hadoop-common in ivy/ivy.xml from 3.1.3 to 3.3.1

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2878:
---
Fix Version/s: (was: 1.19)

> fireant upgrade dependency hadoop-common in ivy/ivy.xml from 3.1.3 to 3.3.1
> ---
>
> Key: NUTCH-2878
> URL: https://issues.apache.org/jira/browse/NUTCH-2878
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, fireant
>Reporter: Lewis John McGibbney
>Priority: Major
>
> fireant upgrade dependency hadoop-common in ivy/ivy.xml from 3.1.3 to 3.3.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (NUTCH-2877) fireant upgrade dependency t-digest in ivy/ivy.xml from 3.2 to 3.3

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2877:


> fireant upgrade dependency t-digest in ivy/ivy.xml from 3.2 to 3.3
> --
>
> Key: NUTCH-2877
> URL: https://issues.apache.org/jira/browse/NUTCH-2877
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, fireant
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 1.19
>
>
> fireant upgrade dependency t-digest in ivy/ivy.xml from 3.2 to 3.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2877) fireant upgrade dependency t-digest in ivy/ivy.xml from 3.2 to 3.3

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-2877.
--
Resolution: Invalid

> fireant upgrade dependency t-digest in ivy/ivy.xml from 3.2 to 3.3
> --
>
> Key: NUTCH-2877
> URL: https://issues.apache.org/jira/browse/NUTCH-2877
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, fireant
>Reporter: Lewis John McGibbney
>Priority: Major
>
> fireant upgrade dependency t-digest in ivy/ivy.xml from 3.2 to 3.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (NUTCH-2878) fireant upgrade dependency hadoop-common in ivy/ivy.xml from 3.1.3 to 3.3.1

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2878:


> fireant upgrade dependency hadoop-common in ivy/ivy.xml from 3.1.3 to 3.3.1
> ---
>
> Key: NUTCH-2878
> URL: https://issues.apache.org/jira/browse/NUTCH-2878
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, fireant
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 1.19
>
>
> fireant upgrade dependency hadoop-common in ivy/ivy.xml from 3.1.3 to 3.3.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[DISCUSS] Release 1.19 ?

2022-08-09 Thread Sebastian Nagel
Hi all,

more than 60 issues are done for Nutch 1.19

  https://issues.apache.org/jira/projects/NUTCH/versions/12349580

including
 - important dependency upgrades
   - Hadoop 3.3.3
   - Any23 2.7
   - Tika 2.3.0
 - plugin-specific URL stream handlers (NUTCH-2429)
 - migration
   - from Java/JDK 8 to 11
   - from Log4j 1 to Log4j 2

... and various other fixes and improvements.

The last release (1.18) happened in January 2021, so it's definitely high time
to release 1.19. As usual, we'll check all remaining issues whether they should
be fixed now or can be done in a later release.

I would be ready to push a release candidate during the next two weeks and
will start to work through the remaining issues and also check for dependency
upgrades required to address potential vulnerabilities. Please, comment on
issues you want to get fixed already in 1.19! Reviews of open pull requests and
patches are also welcome!

Thanks,
Sebastian


[jira] [Updated] (NUTCH-2244) Publish Protocol-Interactiveselenium to central maven repo

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2244:
---
Fix Version/s: (was: 1.19)

> Publish Protocol-Interactiveselenium to central maven repo
> --
>
> Key: NUTCH-2244
> URL: https://issues.apache.org/jira/browse/NUTCH-2244
> Project: Nutch
>  Issue Type: Bug
>Reporter: Raghav Bharadwaj Jayasimha Rao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2876) TEST

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-2876.
--
Resolution: Invalid

> TEST
> 
>
> Key: NUTCH-2876
> URL: https://issues.apache.org/jira/browse/NUTCH-2876
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, fireant
>Reporter: Lewis John McGibbney
>Priority: Major
>
> TEST



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (NUTCH-2876) TEST

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2876:


> TEST
> 
>
> Key: NUTCH-2876
> URL: https://issues.apache.org/jira/browse/NUTCH-2876
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, fireant
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 1.19
>
>
> TEST



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2876) TEST

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2876:
---
Fix Version/s: (was: 1.19)

> TEST
> 
>
> Key: NUTCH-2876
> URL: https://issues.apache.org/jira/browse/NUTCH-2876
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, fireant
>Reporter: Lewis John McGibbney
>Priority: Major
>
> TEST



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2875) TEST

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-2875.
--
Resolution: Invalid

> TEST
> 
>
> Key: NUTCH-2875
> URL: https://issues.apache.org/jira/browse/NUTCH-2875
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, fireant
>Reporter: Lewis John McGibbney
>Priority: Major
>
> TEST



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2874) TEST

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-2874.
--
Resolution: Invalid

> TEST
> 
>
> Key: NUTCH-2874
> URL: https://issues.apache.org/jira/browse/NUTCH-2874
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, fireant
>Reporter: Lewis John McGibbney
>Priority: Major
>
> this is a Fireant test



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2875) TEST

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2875:
---
Fix Version/s: (was: 1.19)

> TEST
> 
>
> Key: NUTCH-2875
> URL: https://issues.apache.org/jira/browse/NUTCH-2875
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, fireant
>Reporter: Lewis John McGibbney
>Priority: Major
>
> TEST



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (NUTCH-2875) TEST

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2875:


> TEST
> 
>
> Key: NUTCH-2875
> URL: https://issues.apache.org/jira/browse/NUTCH-2875
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, fireant
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 1.19
>
>
> TEST



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2874) TEST

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2874:
---
Fix Version/s: (was: 1.19)

> TEST
> 
>
> Key: NUTCH-2874
> URL: https://issues.apache.org/jira/browse/NUTCH-2874
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, fireant
>Reporter: Lewis John McGibbney
>Priority: Major
>
> this is a Fireant test



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (NUTCH-2874) TEST

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2874:


> TEST
> 
>
> Key: NUTCH-2874
> URL: https://issues.apache.org/jira/browse/NUTCH-2874
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, fireant
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 1.19
>
>
> this is a Fireant test



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2293) Make the unit tests which requires "plugin.folders" as integration tests

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2293:
---
Fix Version/s: (was: 1.19)

> Make the unit tests which requires "plugin.folders" as integration tests
> 
>
> Key: NUTCH-2293
> URL: https://issues.apache.org/jira/browse/NUTCH-2293
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: 1.15
>Reporter: Thamme Gowda
>Priority: Major
>
> The system property "plugin.folders" is heavily used in unit tests of 
> nutch-core. 
> Some of the utilities used by the tests in plugins also requires this 
> property to be set.
> These tests ought to be run after the package goal is executed, so configure 
> the build to defer these tests for post-package (one solution is to make them 
> as integration tests rather than unit tests)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2292) Mavenize the build for nutch-core and nutch-plugins

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2292:
---
Fix Version/s: (was: 1.19)

> Mavenize the build for nutch-core and nutch-plugins
> ---
>
> Key: NUTCH-2292
> URL: https://issues.apache.org/jira/browse/NUTCH-2292
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
>Priority: Major
>  Labels: gsoc2019
>
> Convert the build system of  nutch-core as well as plugins to Apache Maven.
> *Plan :*
> Create multi-module maven project with the following structure
> {code}
> nutch-parent
>   |-- pom.xml (POM)
>   |-- nutch-core
>   |   |-- pom.xml (JAR)
>   |   |--src: sources
>   |-- nutch-plugins
>   |-- pom.xml (POM)
>   |-- plugin1
>   ||-- pom.xml (JAR)
>   | .
>   |-- pluginN
>|-- pom.xml (JAR)
> {code}
> NOTE: watch out for cyclic dependencies bwteen nutch-core and plugins, 
> introduce another POM to break the cycle if required.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2638) Publish plugins in Maven

2022-08-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2638:
---
Fix Version/s: (was: 1.19)

> Publish plugins in Maven
> 
>
> Key: NUTCH-2638
> URL: https://issues.apache.org/jira/browse/NUTCH-2638
> Project: Nutch
>  Issue Type: Task
>Reporter: Rustam Abdullaev
>Priority: Major
>  Labels: maven, plugins
>
> The Nutch core is available in [Maven 
> central|https://search.maven.org/search?q=g:org.apache.nutch], but its 
> plugins aren't.
> Please publish the plugins in Maven as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)