[jira] [Closed] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-05-15 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3041.
---

> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-05-15 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3041 stopped by Lewis John McGibbney.
---
> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-05-15 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3041.
-
Resolution: Fixed

> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3054.
---

> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3054.
-
Resolution: Fixed

> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-29 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3054:

Affects Version/s: 1.20

> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3054:
---

 Summary: Address deprecation of Node16 for all GitHub Actions
 Key: NUTCH-3054
 URL: https://issues.apache.org/jira/browse/NUTCH-3054
 Project: Nutch
  Issue Type: Task
  Components: ci/cd
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.21


See 
[https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]

We need to upgrade the setup-java action in  
[https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
 

Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-29 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3054 started by Lewis John McGibbney.
---
> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3049) Investigate using Records

2024-04-29 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842208#comment-17842208
 ] 

Lewis John McGibbney commented on NUTCH-3049:
-

I think that each of the Writable classes mentioned in NutchWritable may be 
fair game

{{        org.apache.nutch.crawl.CrawlDatum.class,}}
{{        org.apache.nutch.crawl.Inlink.class,}}
{{        org.apache.nutch.crawl.Inlinks.class,}}
{{        org.apache.nutch.indexer.NutchIndexAction.class,}}
{{        org.apache.nutch.metadata.Metadata.class,}}
{{        org.apache.nutch.parse.Outlink.class,}}
{{        org.apache.nutch.parse.ParseText.class,}}
{{        org.apache.nutch.parse.ParseData.class,}}
{{        org.apache.nutch.parse.ParseImpl.class,}}
{{        org.apache.nutch.parse.ParseStatus.class,}}
{{        org.apache.nutch.protocol.Content.class,}}
{{        org.apache.nutch.protocol.ProtocolStatus.class,}}
{{        org.apache.nutch.scoring.webgraph.LinkDatum.class,}}
{{        org.apache.nutch.hostdb.HostDatum.class}}

> Investigate using Records
> -
>
> Key: NUTCH-3049
> URL: https://issues.apache.org/jira/browse/NUTCH-3049
> Project: Nutch
>  Issue Type: Sub-task
>        Reporter: Lewis John McGibbney
>Priority: Major
>
> Guidance at [https://www.baeldung.com/java-migrate-8-to-17#records]
> i think there are multiple areas where we could use Records. This ticket will 
> document the opportunities and structure that work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Consolidating Nutch Continuous Integration

2024-04-29 Thread Lewis John McGibbney
Hi Sebastian,
Understood. If it ain’t broke don’t fix it.
Thanks for the input.

On 2024/04/28 12:08:27 Sebastian Nagel wrote:
> 
>  From my side: no. It may not harm to have both.
> 
> Best,
> Sebastian


[jira] [Created] (NUTCH-3053) Upgrade build and CI to JDK17

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3053:
---

 Summary: Upgrade build and CI to JDK17
 Key: NUTCH-3053
 URL: https://issues.apache.org/jira/browse/NUTCH-3053
 Project: Nutch
  Issue Type: Sub-task
  Components: build, ci/cd
Reporter: Lewis John McGibbney


This will involves changes to
 * 
[https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
 * [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/]
 * [https://github.com/apache/nutch/blob/master/default.properties#L46]
 * [https://github.com/apache/nutch/blob/master/default.properties#L57]
 * We should also investigate any deprecation notices in the build output
 * [https://github.com/apache/nutch/blob/master/ivy/mvn.template#L128-L129]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3052) Investigate using sealed classes

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3052:
---

 Summary: Investigate using sealed classes
 Key: NUTCH-3052
 URL: https://issues.apache.org/jira/browse/NUTCH-3052
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance available at 
[https://www.baeldung.com/java-migrate-8-to-17#sealed-classes]

First document if and where sealed classes would add value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3051) Investigate using new pattern matching syntax in switch expressions

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3051:
---

 Summary: Investigate using new pattern matching syntax in switch 
expressions
 Key: NUTCH-3051
 URL: https://issues.apache.org/jira/browse/NUTCH-3051
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance available at 
[https://www.baeldung.com/java-migrate-8-to-17#2-switch-expressions]

Apparently we use switch in 35 files

[https://github.com/search?q=repo%3Aapache%2Fnutch+switch+language%3AJava=code=Java]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3050) Investigate use of the enhanced instanceof operator

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3050:
---

 Summary: Investigate use of the enhanced instanceof operator
 Key: NUTCH-3050
 URL: https://issues.apache.org/jira/browse/NUTCH-3050
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-enhanced-instanceof-operator]

Apparently we use instanceof operator in 50 files

[https://github.com/search?q=repo%3Aapache%2Fnutch%20instanceof=code]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3049) Investigate using Records

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3049:
---

 Summary: Investigate using Records
 Key: NUTCH-3049
 URL: https://issues.apache.org/jira/browse/NUTCH-3049
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance at [https://www.baeldung.com/java-migrate-8-to-17#records]

i think there are multiple areas where we could use Records. This ticket will 
document the opportunities and structure that work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3048) Investigate where/if new string utility methods could be used

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3048:
---

 Summary: Investigate where/if new string utility methods could be 
used
 Key: NUTCH-3048
 URL: https://issues.apache.org/jira/browse/NUTCH-3048
 Project: Nutch
  Issue Type: Sub-task
  Components: util
Reporter: Lewis John McGibbney


Guidance at [https://www.baeldung.com/java-migrate-8-to-17#3-new-string-methods]

We may be able to also revisit our usage of common-* libraries with tje goal of 
using native methods from JDK.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3047) Use multi-line text blocks

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3047:
---

 Summary: Use multi-line text blocks
 Key: NUTCH-3047
 URL: https://issues.apache.org/jira/browse/NUTCH-3047
 Project: Nutch
  Issue Type: Sub-task
  Components: CLI
Reporter: Lewis John McGibbney


Guidance available at 
[https://www.baeldung.com/java-migrate-8-to-17#2-text-block]

This will help to cleanup our CLI *usage()* messages at a bare minimum.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3046) Use compact strings

2024-04-29 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3046:

Description: 
Follow the guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]

It looks like there are 9 instances where we use _*char []*_

|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].

  was:
Follow the guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]

It looks like there are [9 instances where we use 
char[]|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].


> Use compact strings
> ---
>
> Key: NUTCH-3046
> URL: https://issues.apache.org/jira/browse/NUTCH-3046
> Project: Nutch
>  Issue Type: Sub-task
>        Reporter: Lewis John McGibbney
>Priority: Major
>
> Follow the guidance at 
> [https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]
> It looks like there are 9 instances where we use _*char []*_
> |[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3046) Use compact strings

2024-04-28 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3046:
---

 Summary: Use compact strings
 Key: NUTCH-3046
 URL: https://issues.apache.org/jira/browse/NUTCH-3046
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Follow the guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]

It looks like there are [9 instances where we use 
char[]|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3045) Upgrade from Java 11 to 17

2024-04-28 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3045:
---

 Summary: Upgrade from Java 11 to 17
 Key: NUTCH-3045
 URL: https://issues.apache.org/jira/browse/NUTCH-3045
 Project: Nutch
  Issue Type: Task
  Components: build, ci/cd
Reporter: Lewis John McGibbney
 Fix For: 1.21


This parent issue will track and organize work pertaining to upgrading Nutch to 
JDK 17.

Premier support for Oracle JDK 11 ended 7 months ago (30 Sep 2023).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[ANNOUNCE] Apache Nutch 1.20 Release

2024-04-28 Thread lewis john mcgibbney
The Apache Nutch Project https://nutch.apache.org/download/

Please verify signatures using the KEYS file
https://raw.githubusercontent.com/apache/nutch/master/KEYS when downloading
the release.

This release includes more than 60 bug fixes and improvements, the full
list of changes can be seen in the Jira release report
https://s.apache.org/ovjf3

Thanks to everyone who contributed to this release!

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: [DISCUSS] Consolidating Nutch Continuous Integration

2024-04-25 Thread Lewis John McGibbney
A better reference for the GitHub Actions can be found at 
https://github.com/apache/nutch/actions

lewismc

On 2024/04/25 14:40:35 lewis john mcgibbney wrote:
> Hi dev@,
> 
> We currently maintains a combination of Jenkins [0] and GitHub Actions [1]
> for CI.
> 
> For the longest time, we relied solely on Jenkins. This was really useful
> particularly when committers were pulling build artifacts from Jenkins
> nightly and relied on SVN trunk being stable. The Jenkins job used to be
> run nightly but no longer is. It is not clear exactly when nightly SNAPSHOT
> builds were turned off.
> 
> In 2020 we accepted a pull request [2] which established GitHub Actions and
> since then have gradually added small but important updates to the GitHub
> Actions workflow [3].
> 
> I can elaborate on the details of what each CI workflow does (it is not
> overly complex) but before I do that, is there any preference on choosing
> one (Jenkins Vs GitHub Actions) over the other?
> 
> Thanks
> 
> lewismc
> 
> [0] https://ci-builds.apache.org/job/Nutch/
> [1]
> https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml
> [2]
> https://github.com/apache/nutch/commit/e33aaa14739c7c02f4121ac1d8d0e7860f329e06
> [3]
> https://github.com/apache/nutch/commits/master/.github/workflows/master-build.yml
> 
> -- 
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
> 


[DISCUSS] Consolidating Nutch Continuous Integration

2024-04-25 Thread lewis john mcgibbney
Hi dev@,

We currently maintains a combination of Jenkins [0] and GitHub Actions [1]
for CI.

For the longest time, we relied solely on Jenkins. This was really useful
particularly when committers were pulling build artifacts from Jenkins
nightly and relied on SVN trunk being stable. The Jenkins job used to be
run nightly but no longer is. It is not clear exactly when nightly SNAPSHOT
builds were turned off.

In 2020 we accepted a pull request [2] which established GitHub Actions and
since then have gradually added small but important updates to the GitHub
Actions workflow [3].

I can elaborate on the details of what each CI workflow does (it is not
overly complex) but before I do that, is there any preference on choosing
one (Jenkins Vs GitHub Actions) over the other?

Thanks

lewismc

[0] https://ci-builds.apache.org/job/Nutch/
[1]
https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml
[2]
https://github.com/apache/nutch/commit/e33aaa14739c7c02f4121ac1d8d0e7860f329e06
[3]
https://github.com/apache/nutch/commits/master/.github/workflows/master-build.yml

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[RESULT] WAS Re: [VOTE] Apache Nutch 1.20 Release

2024-04-24 Thread lewis john mcgibbney
Hi user@ & dev@,
I’m glad to conclude the Nutch 1.20 release candidate VOTE thread with the
following RESULT’s.

[5] +1 Release this package as Apache Nutch 1.20
snagel*
balakuntala*
blackice*
Joe Gilvary
lewismc*

[ ] -1 Do not release this package because…

*Nutch Project Management Committee-binding

The Nutch 1.20 release candidate has passed the community VOTE. I will
therefore promote this release casndidate.

Thanks for VOTE’ing and for everyone who contributed to the Apache Nutch
1.20 release.

lewismc

On Tue, Apr 9, 2024 at 2:28 PM lewis john mcgibbney 
wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.20 release is available at [0] where
> accompanying SHA512 and ASC signatures can also be found.
> Information on verifying releases can be found at [1].
>
> The release candidate comprises a .zip and tar.gz archive of the sources
> at [2] and complementary binary distributions. In addition, a staged maven
> repository is available at [3].
>
> The Nutch 1.20 release report is available at [4].
>
> Please vote on releasing this package as Apache Nutch 1.20. The vote is
> open for at least the next 72 hours and passes if a majority of at least
> three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.20.
>
> [ ] -1 Do not release this package because…
>
> Cheers,
> lewismc
> P.S. Here is my +1.
>
> [0] https://dist.apache.org/repos/dist/dev/nutch/1.20
> [1] http://nutch.apache.org/downloads.html#verify
> [2] https://github.com/apache/nutch/tree/release-1.20
> [3]
> https://repository.apache.org/content/repositories/orgapachenutch-1021/
> [4] https://s.apache.org/ovjf3
>
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[jira] [Updated] (NUTCH-3042) Use GitHub cache action to improve CI execution time

2024-04-19 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3042:

Description: 
With the Ant+Ivy build architecture, the current GitHub actions workflow can 
and regularly does take over 20 minutes to complete. Dependency retrieval takes 
a significant amount of time.

I think we can address the above issue and dramatically reduce the CI runtime 
by utilizing the official [GitHiub cache 
action|[https://github.com/actions/cache]].

It appears however that the action does not support the Apache Ivy cache. Both 
Maven and Gradle are supported. I [created a 
discussion|[https://github.com/actions/cache/discussions/1381]] to get 
conformation.

In the case that we cannot implement a cache for the Ivy build system then we 
will need to come back to this issue once we migrate to Gradle.

  was:
With the Ant+Ivy build architecture, the current GitHub actions workflow can 
and regularly does take over 20 minutes to complete. Dependency retrieval takes 
a significant amount of time.

I think we can address the above issue and dramatically reduce the CI runtime 
by utilizing the official [GitHiub cache 
action|[https://github.com/actions/cache]].

It appears however that the action does not support the Apache Ivy cache. Both 
Maven and Gradle are supported. I created a discussion to get conformation if 
this is the case.

In the case that we cannot implement a cache for the Ivy build system then we 
will need to come back to this issue once we migrate to Gradle.


> Use GitHub cache action to improve CI execution time
> 
>
> Key: NUTCH-3042
> URL: https://issues.apache.org/jira/browse/NUTCH-3042
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.21
>
>
> With the Ant+Ivy build architecture, the current GitHub actions workflow can 
> and regularly does take over 20 minutes to complete. Dependency retrieval 
> takes a significant amount of time.
> I think we can address the above issue and dramatically reduce the CI runtime 
> by utilizing the official [GitHiub cache 
> action|[https://github.com/actions/cache]].
> It appears however that the action does not support the Apache Ivy cache. 
> Both Maven and Gradle are supported. I [created a 
> discussion|[https://github.com/actions/cache/discussions/1381]] to get 
> conformation.
> In the case that we cannot implement a cache for the Ivy build system then we 
> will need to come back to this issue once we migrate to Gradle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3042) Use GitHub cache action to improve CI execution time

2024-04-19 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3042:
---

 Summary: Use GitHub cache action to improve CI execution time
 Key: NUTCH-3042
 URL: https://issues.apache.org/jira/browse/NUTCH-3042
 Project: Nutch
  Issue Type: Task
  Components: ci/cd
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.21


With the Ant+Ivy build architecture, the current GitHub actions workflow can 
and regularly does take over 20 minutes to complete. Dependency retrieval takes 
a significant amount of time.

I think we can address the above issue and dramatically reduce the CI runtime 
by utilizing the official [GitHiub cache 
action|[https://github.com/actions/cache]].

It appears however that the action does not support the Apache Ivy cache. Both 
Maven and Gradle are supported. I created a discussion to get conformation if 
this is the case.

In the case that we cannot implement a cache for the Ivy build system then we 
will need to come back to this issue once we migrate to Gradle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3041 started by Lewis John McGibbney.
---
> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3041:

Description: 
URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently [URLExemptionFilters|#L47-L48]] provides the following logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
#0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation is actually configured to be used at runtime.

I will provide a patch for this.

  was:
URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently [URLExemptionFilters|#L47-L48]] provides the following logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
#0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation actually exists for a given URL.

I will provide a patch for this.


> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3041:

Description: 
URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently [URLExemptionFilters|#L47-L48]] provides the following logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
#0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation actually exists for a given URL.

I will provide a patch for this.

  was:
URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently 
[URLExemptionFilters|[https://github.com/apache/nutch/blob/271f92e11c39b7a3583cfcd8d664262cfac59674/src/java/org/apache/nutch/net/URLExemptionFilters.java#L47-L48]]
 provides some confusing INFO-level logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] 
Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation actually exists for a given URL.

I will provide a patch for this.


> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation actually exists for a given URL.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3041:
---

 Summary: Address confusing logging in 
o.a.n.net.URLExemptionFilters 
 Key: NUTCH-3041
 URL: https://issues.apache.org/jira/browse/NUTCH-3041
 Project: Nutch
  Issue Type: Task
  Components: net
Affects Versions: 1.19, 1.20
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.21


URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently 
[URLExemptionFilters|[https://github.com/apache/nutch/blob/271f92e11c39b7a3583cfcd8d664262cfac59674/src/java/org/apache/nutch/net/URLExemptionFilters.java#L47-L48]]
 provides some confusing INFO-level logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] 
Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation actually exists for a given URL.

I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Apache Nutch 1.20 Release

2024-04-16 Thread lewis john mcgibbney
Hi user@, dev@,
Please consider reviewing the Nutch 1.20 release candidate. This is a
critical prerequisite for us making releases of software at TheASF.
Thank you
lewismc

On Tue, Apr 9, 2024 at 2:28 PM lewis john mcgibbney 
wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.20 release is available at [0] where
> accompanying SHA512 and ASC signatures can also be found.
> Information on verifying releases can be found at [1].
>
> The release candidate comprises a .zip and tar.gz archive of the sources
> at [2] and complementary binary distributions. In addition, a staged maven
> repository is available at [3].
>
> The Nutch 1.20 release report is available at [4].
>
> Please vote on releasing this package as Apache Nutch 1.20. The vote is
> open for at least the next 72 hours and passes if a majority of at least
> three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch X.XX.
>
> [ ] -1 Do not release this package because…
>
> Cheers,
> lewismc
> P.S. Here is my +1.
>
> [0] https://dist.apache.org/repos/dist/dev/nutch/1.20
> [1] http://nutch.apache.org/downloads.html#verify
> [2] https://github.com/apache/nutch/tree/release-1.20
> [3]
> https://repository.apache.org/content/repositories/orgapachenutch-1021/
> [4] https://s.apache.org/ovjf3
>
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: [VOTE] Apache Nutch 1.20 Release

2024-04-11 Thread Lewis John McGibbney
Hi Seb,

On 2024/04/11 13:30:53 Sebastian Nagel wrote:
> 
> https://github.com/sebastian-nagel/nutch-test-single-node-cluster/

I think we should make this into an integration test suite and run it as part 
of CI. I’ve been meaning and wanting to do this for the __longest__ time…!

> 
> One note about the CHANGES.md: it's now a mixture of HTML and plain text.
> It does not use the potential of markdown, e.g. sections / headlines for
> the releases to make the change log navigable via a table of contents.
> The embedded HTML makes it less readable if viewed in a text editor.
> The rendering on Github [5] is acceptable with only minor glitches,
> mostly the placement of multiple lines in a single paragraph:
>https://github.com/apache/nutch/blob/branch-1.20/CHANGES.md
> We also have a change log on Jira:
>https://s.apache.org/ovjf3
> That's why I wouldn't call the CHANGES.md a "blocker". We should
> update the formatting after the release to make it again easily
> readable in source code and improve the document structure utilizing
> the markdown markup.

Excellent suggestion. I was focusing on including the hyperlinks and clearly 
compromised other change log benefits. I will address this after the release. 
Thank you


Re: Mentor request for lewismc

2024-04-09 Thread Lewis John McGibbney
Please resend Sanyam I am not in receipt of the invitation yet.
Thank you
lewismc

On 2024/04/07 21:28:05 Sanyam Goel wrote:
> Hi
> 
> Invitation Sent,
> 
> Regards,
> Sanyam Goel
> 
> On Sun, Apr 7, 2024 at 11:17 PM Furkan KAMACI 
> wrote:
> 
> > Hi,
> >
> > ACK!
> >
> > Kind regards,
> > Furkan Kamaci
> >
> > On Sun, Apr 7, 2024 at 8:45 PM lewis john mcgibbney 
> > wrote:
> >
> > > Hi Nutch PMC,
> > > Please acknowledge and approve my request to mentor this years GSoC
> > > program.
> > > An ACK is sufficient.
> > > Thank you
> > > lewismc
> > >
> >
> 


[VOTE] Apache Nutch 1.20 Release

2024-04-09 Thread lewis john mcgibbney
Hi Folks,

A first candidate for the Nutch 1.20 release is available at [0] where
accompanying SHA512 and ASC signatures can also be found.
Information on verifying releases can be found at [1].

The release candidate comprises a .zip and tar.gz archive of the sources at
[2] and complementary binary distributions. In addition, a staged maven
repository is available at [3].

The Nutch 1.20 release report is available at [4].

Please vote on releasing this package as Apache Nutch 1.20. The vote is
open for at least the next 72 hours and passes if a majority of at least
three +1 Nutch PMC votes are cast.

[ ] +1 Release this package as Apache Nutch X.XX.

[ ] -1 Do not release this package because…

Cheers,
lewismc
P.S. Here is my +1.

[0] https://dist.apache.org/repos/dist/dev/nutch/1.20
[1] http://nutch.apache.org/downloads.html#verify
[2] https://github.com/apache/nutch/tree/release-1.20
[3] https://repository.apache.org/content/repositories/orgapachenutch-1021/
[4] https://s.apache.org/ovjf3

--
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[jira] [Resolved] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-08 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3038.
-
Resolution: Fixed

> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-08 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3038.
---

Thanks [~snagel] 

> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-08 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3038 stopped by Lewis John McGibbney.
---
> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Mentor request for lewismc

2024-04-07 Thread lewis john mcgibbney
Hi Nutch PMC,
Please acknowledge and approve my request to mentor this years GSoC program.
An ACK is sufficient.
Thank you
lewismc


[jira] [Work started] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-05 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3038 started by Lewis John McGibbney.
---
> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-05 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3038:

Description: 
During the 1.20 release management dryrun I discovered the following issues 
which I think should be addressed in order to be satisfied with the release 
candidate
 # Update docker/README to remove broken badge
 # Upgrade alpine base image in docker/Dockerfile
 # Migrate CHANGES.txt to CHANGES.md
 # Upgrade apache parent pom version from 23 to 31
 # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
 # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
ivy/mvn.template
 # Remove miredot plugin usage from ivy/mvn.template

  was:
During the 1.20 release management dryrun I discovered the following issues 
which I think should be addressed in order to be satisfied with the release 
candidate
 # Update docker/README to remove broken badge
 # Upgrade alpine base image in docker/Dockerfile
 # Migrate CHANGES.txt to CHANGES.md
 # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
 # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
ivy/mvn.template
 # Remove miredot plugin usage from ivy/mvn.template


> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-05 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3038:
---

 Summary: Address issues discovered during 1.20 release management 
dryrun
 Key: NUTCH-3038
 URL: https://issues.apache.org/jira/browse/NUTCH-3038
 Project: Nutch
  Issue Type: Task
  Components: build, docker
Affects Versions: 1.20
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


During the 1.20 release management dryrun I discovered the following issues 
which I think should be addressed in order to be satisfied with the release 
candidate
 # Update docker/README to remove broken badge
 # Upgrade alpine base image in docker/Dockerfile
 # Migrate CHANGES.txt to CHANGES.md
 # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
 # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
ivy/mvn.template
 # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-04-04 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3032.
---

Thanks [~jglvary] and congratulations on your first contribution to Apache 
Nutch :)

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Assignee: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Fix For: 1.20
>
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3032:

Fix Version/s: 1.20

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Assignee: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Fix For: 1.20
>
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-3032:
---

Assignee: Joe Gilvary

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Assignee: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-2856) Implement a protocol-smb plugin based on hierynomus/smbj

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2856 stopped by Lewis John McGibbney.
---
> Implement a protocol-smb plugin based on hierynomus/smbj
> 
>
> Key: NUTCH-2856
> URL: https://issues.apache.org/jira/browse/NUTCH-2856
> Project: Nutch
>  Issue Type: New Feature
>  Components: external, plugin, protocol
>Reporter: Hiran Chaudhuri
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> The plugin protocol-smb advertized on 
> [https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral] actually 
> refers to the JCIFS library. According to this library's homepage 
> [https://www.jcifs.org/]:
> _If you're looking for the latest and greatest open source Java SMB library, 
> this is not it. JCIFS has been in maintenance-mode-only for several years and 
> although what it does support works fine (SMB1, NTLMv2, midlc, MSRPC and 
> various utility classes), jCIFS does not support the newer SMB2/3 variants of 
> the SMB protocol which is slowly becoming required (Windows 10 requires 
> SMB2/3). JCIFS only supports SMB1 but Microsoft has deprecated SMB1 in their 
> products. *So if SMB1 is disabled on your network, JCIFS' file related 
> operations will NOT work.*_
> Looking at 
> [https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1:|https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1]
> _Microsoft added SMB1 to the Windows Server 2012 R2 deprecation list in June 
> 2013. Windows Server 2016 and some versions of Windows 10 Fall Creators 
> Update do not have SMB1 installed by default._
> As a conclusion, the chances that SMB1 protocol is installed and/or 
> configured are getting vastly smaller. Therefore some migration towards 
> SMB2/3 is required. Luckily the JCIFS homepage lists alternatives:
>  * [jcifs-codelibs|https://github.com/codelibs/jcifs]
>  * [jcifs-ng|https://github.com/AgNO3/jcifs-ng]
>  * [smbj|https://github.com/hierynomus/smbj]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-2887) Migrate to JUnit 5 Jupiter

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2887 stopped by Lewis John McGibbney.
---
> Migrate to JUnit 5 Jupiter
> --
>
> Key: NUTCH-2887
> URL: https://issues.apache.org/jira/browse/NUTCH-2887
> Project: Nutch
>  Issue Type: Improvement
>  Components: test
> Environment: Migrate 
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> This effort is a bit of a beast. See the [JUnit migration 
> tips|https://junit.org/junit5/docs/current/user-guide/#migrating-from-junit4-tips]
>  for general guidance. A general grep for junit in src produces the following
> {code:bash}
> ./test/nutch-site.xml
> ./test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java
> ./test/org/apache/nutch/net/TestURLNormalizers.java
> ./test/org/apache/nutch/net/protocols/TestHttpDateFormat.java
> ./test/org/apache/nutch/net/TestURLFilters.java
> ./test/org/apache/nutch/util/TestStringUtil.java
> ./test/org/apache/nutch/util/TestSuffixStringMatcher.java
> ./test/org/apache/nutch/util/TestEncodingDetector.java
> ./test/org/apache/nutch/util/TestMimeUtil.java
> ./test/org/apache/nutch/util/TestPrefixStringMatcher.java
> ./test/org/apache/nutch/util/DumpFileUtilTest.java
> ./test/org/apache/nutch/util/TestNodeWalker.java
> ./test/org/apache/nutch/util/WritableTestUtils.java
> ./test/org/apache/nutch/util/TestTableUtil.java
> ./test/org/apache/nutch/util/TestURLUtil.java
> ./test/org/apache/nutch/util/TestGZIPUtils.java
> ./test/org/apache/nutch/parse/TestParseText.java
> ./test/org/apache/nutch/parse/TestOutlinks.java
> ./test/org/apache/nutch/parse/TestParseData.java
> ./test/org/apache/nutch/parse/TestOutlinkExtractor.java
> ./test/org/apache/nutch/parse/TestParserFactory.java
> ./test/org/apache/nutch/segment/TestSegmentMerger.java
> ./test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java
> ./test/org/apache/nutch/plugin/TestPluginSystem.java
> ./test/org/apache/nutch/fetcher/TestFetcher.java
> ./test/org/apache/nutch/protocol/TestProtocolFactory.java
> ./test/org/apache/nutch/protocol/TestContent.java
> ./test/org/apache/nutch/protocol/AbstractHttpProtocolPluginTest.java
> ./test/org/apache/nutch/crawl/TestCrawlDbFilter.java
> ./test/org/apache/nutch/crawl/TestTextProfileSignature.java
> ./test/org/apache/nutch/crawl/TestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestGenerator.java
> ./test/org/apache/nutch/crawl/TestAdaptiveFetchSchedule.java
> ./test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestSignatureFactory.java
> ./test/org/apache/nutch/crawl/ContinuousCrawlTestUtil.java
> ./test/org/apache/nutch/crawl/TestInjector.java
> ./test/org/apache/nutch/crawl/TestLinkDbMerger.java
> ./test/org/apache/nutch/crawl/TestCrawlDbMerger.java
> ./test/org/apache/nutch/service/TestNutchServer.java
> ./test/org/apache/nutch/metadata/TestMetadata.java
> ./test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
> ./test/org/apache/nutch/indexer/TestIndexingFilters.java
> ./test/org/apache/nutch/indexer/TestIndexerMapReduce.java
> ./bin/nutch
> ./plugin/scoring-orphan/src/test/org/apache/nutch/scoring/orphan/TestOrphanScoringFilter.java
> ./plugin/index-basic/src/test/org/apache/nutch/indexer/basic/TestBasicIndexingFilter.java
> ./plugin/urlfilter-domaindenylist/build.xml
> ./plugin/urlfilter-domaindenylist/src/test/org/apache/nutch/urlfilter/domaindenylist/TestDomainDenylistURLFilter.java
> ./plugin/protocol-imaps/plugin.xml
> ./plugin/protocol-imaps/ivy.xml
> ./plugin/protocol-imaps/lib/junit-4.13.jar
> ./plugin/protocol-imaps/lib/greenmail-junit4-1.6.0.jar
> ./plugin/protocol-imaps/lib/greenmail-1.6.0.jar
> ./plugin/protocol-imaps/src/test/org/apache/nutch/protocol/imaps/TestImaps.java
> ./plugin/protocol-file/build.xml
> ./plugin/protocol-file/src/test/org/apache/nutch/protocol/file/TestProtocolFile.java
> ./plugin/urlnormalizer-regex/build.xml
> ./plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java
> ./plugin/build-plugin.xml
> ./plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java
> ./plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java
> ./plugin/urlnormalizer-protocol/build.xml
> ./plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer/protocol/TestProtocolURLNormalizer.java
> ./plugin/urlfilter-prefix/src/test/org/apache/nutch/urlfilter/prefi

[jira] [Closed] (NUTCH-2832) Create tutorial on sending Nutch logs to Elasticsearch

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2832.
---

> Create tutorial on sending Nutch logs to Elasticsearch
> --
>
> Key: NUTCH-2832
> URL: https://issues.apache.org/jira/browse/NUTCH-2832
> Project: Nutch
>  Issue Type: New Feature
>  Components: configuration, deployment
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> A while back I used to use [Chukwa|https://chukwa.apache.org/] for log 
> aggregation and analysis. Chukwa is now retired. 
> I a bit of research into directly logging Log4j2 into Elasticsearch and came 
> across 
> [log4j2-elasticsearch|https://github.com/rfoltyns/log4j2-elasticsearch] which 
> looks pretty simple.
> I'm going to have a crack at implementing this functionality as a 
> configuration option. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2832) Create tutorial on sending Nutch logs to Elasticsearch

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2832.
-
Resolution: Won't Fix

Given the license changes regarding the concerned backend I have no interest 
implementing this anymore. 

> Create tutorial on sending Nutch logs to Elasticsearch
> --
>
> Key: NUTCH-2832
> URL: https://issues.apache.org/jira/browse/NUTCH-2832
> Project: Nutch
>  Issue Type: New Feature
>  Components: configuration, deployment
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> A while back I used to use [Chukwa|https://chukwa.apache.org/] for log 
> aggregation and analysis. Chukwa is now retired. 
> I a bit of research into directly logging Log4j2 into Elasticsearch and came 
> across 
> [log4j2-elasticsearch|https://github.com/rfoltyns/log4j2-elasticsearch] which 
> looks pretty simple.
> I'm going to have a crack at implementing this functionality as a 
> configuration option. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3036.
-
Resolution: Fixed

> Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
> 
>
> Key: NUTCH-3036
> URL: https://issues.apache.org/jira/browse/NUTCH-3036
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, selenium
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> lib-selenium currently packages org.seleniumhq.selenium:selenium-java 
> *v4.7.2* but *v4.18.1* is available on Maven Central.
> This ticket will upgrade the java dependency and validate that both 
> protocol-selenium and protocol-interactiveselenium work as expected in local 
> mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3036.
---

> Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
> 
>
> Key: NUTCH-3036
> URL: https://issues.apache.org/jira/browse/NUTCH-3036
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, selenium
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> lib-selenium currently packages org.seleniumhq.selenium:selenium-java 
> *v4.7.2* but *v4.18.1* is available on Maven Central.
> This ticket will upgrade the java dependency and validate that both 
> protocol-selenium and protocol-interactiveselenium work as expected in local 
> mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3035) Update license and notice file for release of 1.20

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3035.
---

> Update license and notice file for release of 1.20 
> ---
>
> Key: NUTCH-3035
> URL: https://issues.apache.org/jira/browse/NUTCH-3035
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Close to the release of 1.20 the license and notice files should be updated 
> to contain all (third-party) licenses of all dependencies. Cf. NUTCH-2290 and 
> NUTCH-2981.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3035) Update license and notice file for release of 1.20

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3035.
-
Resolution: Fixed

> Update license and notice file for release of 1.20 
> ---
>
> Key: NUTCH-3035
> URL: https://issues.apache.org/jira/browse/NUTCH-3035
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Close to the release of 1.20 the license and notice files should be updated 
> to contain all (third-party) licenses of all dependencies. Cf. NUTCH-2290 and 
> NUTCH-2981.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3037.
-
Resolution: Fixed

> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3037.
---

> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-21 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3037 stopped by Lewis John McGibbney.
---
> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-21 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3037:

Flags: Patch

> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-21 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3037 started by Lewis John McGibbney.
---
> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-21 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3037:
---

 Summary: Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
 Key: NUTCH-3037
 URL: https://issues.apache.org/jira/browse/NUTCH-3037
 Project: Nutch
  Issue Type: Task
  Components: indexer-kafka
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, I 
therefore propose to upgrade.

I will also state that a _*kafka_2.13*_ artifact exists. This would demand that 
the underlying Scala version be also upgraded... but I think this should be 
addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-14 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3036 stopped by Lewis John McGibbney.
---
> Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
> 
>
> Key: NUTCH-3036
> URL: https://issues.apache.org/jira/browse/NUTCH-3036
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, selenium
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> lib-selenium currently packages org.seleniumhq.selenium:selenium-java 
> *v4.7.2* but *v4.18.1* is available on Maven Central.
> This ticket will upgrade the java dependency and validate that both 
> protocol-selenium and protocol-interactiveselenium work as expected in local 
> mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-14 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3036 started by Lewis John McGibbney.
---
> Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
> 
>
> Key: NUTCH-3036
> URL: https://issues.apache.org/jira/browse/NUTCH-3036
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, selenium
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> lib-selenium currently packages org.seleniumhq.selenium:selenium-java 
> *v4.7.2* but *v4.18.1* is available on Maven Central.
> This ticket will upgrade the java dependency and validate that both 
> protocol-selenium and protocol-interactiveselenium work as expected in local 
> mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-14 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3036:
---

 Summary: Upgrade org.seleniumhq.selenium:selenium-java dependency 
in lib-selenium
 Key: NUTCH-3036
 URL: https://issues.apache.org/jira/browse/NUTCH-3036
 Project: Nutch
  Issue Type: Improvement
  Components: selenium, plugin
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


lib-selenium currently packages org.seleniumhq.selenium:selenium-java *v4.7.2* 
but *v4.18.1* is available on Maven Central.

This ticket will upgrade the java dependency and validate that both 
protocol-selenium and protocol-interactiveselenium work as expected in local 
mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826776#comment-17826776
 ] 

Lewis John McGibbney commented on NUTCH-3029:
-

Hi [~martin.dj] [~markus17] it looks like we are missing some Javadoc

 
{quote} [javadoc] Standard Doclet version 11.0.22 {quote}
{quote} [javadoc] Building tree for all the packages and classes... 
 [javadoc] 
/home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:193:
 warning: no @param for url 
 [javadoc] public static String getHostName(String url) throws 
URISyntaxException { 
 [javadoc] ^ 
 [javadoc] 
/home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:193:
 warning: no @return 
 [javadoc] public static String getHostName(String url) throws 
URISyntaxException { 
 [javadoc] ^ 
 [javadoc] 
/home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:193:
 warning: no @throws for java.net.URISyntaxException 
 [javadoc] public static String getHostName(String url) throws 
URISyntaxException { 
 [javadoc] ^ 
 [javadoc] 
/home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:205:
 warning: no @return 
 [javadoc] public float getMaxInterval(Text url, float defaultMaxInterval){ 
 [javadoc] ^ 
 [javadoc] 
/home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:227:
 warning: no @return 
 [javadoc] public float getMinInterval(Text url, float defaultMinInterval){ 
{quote}
{quote} [javadoc] ^{quote}
 

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-13 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3033.
---

> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-13 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3033.
-
Resolution: Fixed

> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Release Nutch 1.20

2024-03-12 Thread Lewis John McGibbney
I submitted a patch for the Ivy 2.5.2 upgrade. If folks could have a look at 
that it would be ideal.
https://github.com/apache/nutch/pull/803
I am free to roll a release candidate towards the end of this week.
lewismc

On 2024/03/10 15:08:36 Lewis John McGibbney wrote:
> Nice  
> I wee that we  are a couple releases behind of Ivy as well as I’ll submit a 
> patch for that.
> I can push this release this time. It’s been a while since I exercised the 
> workflow and it would be good to blow away the cobb webs.
> lewismc
> 
> On 2024/03/10 11:55:20 Markus Jelsma wrote:
> > Good idea! I'll finish work on three open issues the next week.
> > 
> > Op za 9 mrt 2024 om 13:02 schreef Sebastian Nagel <
> > wastl.na...@googlemail.com>:
> > 
> > > Hi Lewis,
> > >
> > > yes, of course!
> > >
> > > Some points we should do before the release:
> > >
> > > - address the ES licensing issue,
> > >the easiest way is to downgrade, see NUTCH-3008
> > >If done update the license-related files.
> > >
> > > - there are three short PRs open
> > >
> > > I'll try to have a look at these points the next days.
> > >
> > > Best,
> > > Sebastian
> > >
> > >
> > > On 3/8/24 01:43, lewis john mcgibbney wrote:
> > > > Hi dev@,
> > > > As of today, 51 issues have been addressed in the 1.20 development 
> > > > drive.
> > > > https://issues.apache.org/jira/projects/NUTCH/versions/12352190
> > > > <https://issues.apache.org/jira/projects/NUTCH/versions/12352190>
> > > > I would like to push a release soon and ship it to the user community.
> > > > Any objections?
> > > > Thank you
> > > > lewismc
> > > >
> > >
> > 
> 


[jira] [Updated] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3033:

Due Date: 12/Mar/24  (was: 11/Mar/24)

> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3033 stopped by Lewis John McGibbney.
---
> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GSoC 2024 PROPOSAL] Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread lewis john mcgibbney
Hi user@ & dev@,

I decided to write up a GSoC’24 proposal and encourage interested
applicants to register your interest in the JIRA issue or else reach
out to the Nutch PMC over on dev@nutch.apache.org (please CC
lewi...@apache.org).

Title: Overhaul the legacy Nutch plugin framework and replace it with PF4J
JIRA: https://issues.apache.org/jira/browse/NUTCH-3034

Thanks in advance, and good luck to prospective GSoC applicants.

lewismc

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 currently 7 tests as of writing. Traditionally, developers have focused on 
providing unit tests on the plugin-level as opposed to the legacy plugin 
framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.
 # generally speaking, any reduction of code in the Nutch codebase through 
careful selection and dependence of well maintained, well tested 3rd party 
libraries would be a good thing for the Nutch codebase.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 # {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :). Generally 
speaking just familiarize ones-self with the legacy plugin framework and 
understand where the gaps are.
 # *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will 
provide an opportunity to identify gaps between what the legacy plugin 
framework does (and what Nutch) needs Vs what PF4J provides. Touch base with 
the PF4J community, describe the intention to replace the legacy Nutch plugin 
framework with PF4J. Obtain guidance on how to proceed. Document this all in 
the Nutch wiki. Create mapping of [legacy 
Classes|[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]]
 to [PF4J 
equivalents|[https://github.com/pf4j/pf4j/tree/master/pf4j/src/main/java/org/pf4j]].
 # {*}Restructure the legacy Nutch plugin package{*}: 
[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]
 # {*}Restructure each plugin in the plugins directory{*}: 
[https://github.com/apache/nutch/tree/master/src/plugin]
 # *Update Nutch plugin documentation* 
 # {*}Create/propose plugin utility toolings{*}: #4 in the motivation section 
states that developing plugins in clunky. A utility tool which streamlines the 
creation of new plugins would be ideal. For example, this could take the form 
of a [new bash script|[https://github.com/apache/nutch/tree/master/src/bin]] 
which prompts the developer for input and then generates the plugin skeleton. 
{*}This is a nice to have{*}.

h1. Google Summer of Code Details

This initiative is being proposed as a GSoC 2024 project. 

{*}Proposed Mentor{*}: [~lewismc] 

{*}Proposed Co-Mentor{*}:

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 currently 7 tests as of writing. Traditionally, developers have focused on 
providing unit tests on the plugin-level as opposed to the legacy plugin 
framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.
 # generally speaking, any reduction of code in the Nutch codebase through 
careful selection and dependence of well maintained, well tested 3rd party 
libraries would be a good thing for the Nutch codebase.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 # {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :). Generally 
speaking just familiarize ones-self with the legacy plugin framework and 
understand where the gaps are.
 # *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will 
provide an opportunity to identify gaps between what the legacy plugin 
framework does (and what Nutch) needs Vs what PF4J provides. Touch base with 
the PF4J community, describe the intention to replace the legacy Nutch plugin 
framework with PF4J. Obtain guidance on how to proceed. Document this all in 
the Nutch wiki. Create mapping of [legacy 
Classes|[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]]
 to [PF4J 
equivalents|[https://github.com/pf4j/pf4j/tree/master/pf4j/src/main/java/org/pf4j]].
 # {*}Restructure the legacy Nutch plugin package{*}: 
[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]
 # {*}Restructure each plugin in the plugins directory{*}: 
[https://github.com/apache/nutch/tree/master/src/plugin]

 
h1. Google Summer of Code Details

This initiative is being proposed as a GSoC 2024 project. 

{*}Proposed Mentor{*}: [~lewismc] 

{*}Proposed Co-Mentor{*}:

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 currently 7 tests as of writing. Traditionally, developers have focused on 
providing unit tests on the plugin-level as opposed to the legacy plugin 
framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.
 # generally speaking, any reduction of code in the Nutch codebase through 
careful selection and dependence of well maintained, well tested 3rd party 
libraries would be a good thing for the Nutch codebase.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 # {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :). Generally 
speaking just familiarize ones-self with the legacy plugin framework and 
understand where the gaps are.
 # *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will 
provide an opportunity to identify gaps between what the legacy plugin 
framework does (and what Nutch) needs Vs what PF4J provides. Touch base with 
the PF4J community, describe the intention to replace the legacy Nutch plugin 
framework with PF4J. Obtain guidance on how to proceed. Document this all in 
the Nutch wiki. Create mapping of [legacy 
Classes|[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]]
 to [PF4J 
equivalents|[https://github.com/pf4j/pf4j/tree/master/pf4j/src/main/java/org/pf4j]].
 # {*}Restructure the legacy Nutch plugin package{*}: 
[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]
 # {*}Restructure each plugin in the plugins directory{*}: 
[https://github.com/apache/nutch/tree/master/src/plugin]
 #  

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :).
 * *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will 
provide an opportunity to identify gaps between what the legacy plugin 
framework does (and what Nutch) needs Vs what PF4J provides. Touch base with 
the PF4J community, describe the intention to replace the legacy Nutch plugin 
framework with PF4J. Obtain guidance on how to proceed. Document this all in 
the Nutch wiki.
 *  

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :)
 *  

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are fairly well 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, \{*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :)
 *  

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from

[jira] [Created] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3034:
---

 Summary: Overhaul the legacy Nutch plugin framework and replace it 
with PF4J
 Key: NUTCH-3034
 URL: https://issues.apache.org/jira/browse/NUTCH-3034
 Project: Nutch
  Issue Type: Improvement
  Components: pf4j, plugin
Reporter: Lewis John McGibbney


h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are fairly well 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, \{*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :)
 *  

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-11 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3033:
---

 Summary: Upgrade Ivy to v2.5.2
 Key: NUTCH-3033
 URL: https://issues.apache.org/jira/browse/NUTCH-3033
 Project: Nutch
  Issue Type: Task
  Components: ivy
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.

[https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-11 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3033 started by Lewis John McGibbney.
---
> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Release Nutch 1.20

2024-03-10 Thread Lewis John McGibbney
Nice  
I wee that we  are a couple releases behind of Ivy as well as I’ll submit a 
patch for that.
I can push this release this time. It’s been a while since I exercised the 
workflow and it would be good to blow away the cobb webs.
lewismc

On 2024/03/10 11:55:20 Markus Jelsma wrote:
> Good idea! I'll finish work on three open issues the next week.
> 
> Op za 9 mrt 2024 om 13:02 schreef Sebastian Nagel <
> wastl.na...@googlemail.com>:
> 
> > Hi Lewis,
> >
> > yes, of course!
> >
> > Some points we should do before the release:
> >
> > - address the ES licensing issue,
> >the easiest way is to downgrade, see NUTCH-3008
> >If done update the license-related files.
> >
> > - there are three short PRs open
> >
> > I'll try to have a look at these points the next days.
> >
> > Best,
> > Sebastian
> >
> >
> > On 3/8/24 01:43, lewis john mcgibbney wrote:
> > > Hi dev@,
> > > As of today, 51 issues have been addressed in the 1.20 development drive.
> > > https://issues.apache.org/jira/projects/NUTCH/versions/12352190
> > > <https://issues.apache.org/jira/projects/NUTCH/versions/12352190>
> > > I would like to push a release soon and ship it to the user community.
> > > Any objections?
> > > Thank you
> > > lewismc
> > >
> >
> 


Re: Indexing arbitrary fields

2024-03-08 Thread Lewis John McGibbney
Hi Joe,
Thanks for describing your work in detail. It provides a great utility which I 
think could be of immense value.
Please feel free to create a JIRA ticket which can be used as the basis for 
linking to the prior similar examples you referenced.
A WIP pull request would be ideal.
Thanks
lewismc

On 2024/03/08 01:06:18 Joe Gilvary wrote:
> Good day, all,
> 
> I wanted to index some values that I had to derive from fields in the 
> NutchDocument. I started on an indexing plugin. Then I realized I would 
> need more than one, or I could generalize the plugin. I went with the 
> generalizing and wrote a plugin that will use custom POJOs to process & 
> inject whatever the Nutch user wants, based on properties in 
> NUTCH_CONF_DIR/nutch-site.xml. I've tested it so far with
> 
> one POJO that uses jsoup to extract values from the page based on a CSS 
> selector specified in nutch-site.xml,
> 
> another POJO that takes a regex from nutch-site.xml and applies it to 
> the URL to determine how "deep" the URL directory structure goes for the 
> document,
> 
> and a third toy POJO to take multiple arguments from nutch-site.xml and 
> return their product. That last test was just to be sure the plug-in 
> would handle more than two arguments in the property value.
> 
> There's an optional boolean in the config to set whether to overwrite an 
> existing field, or (by default) add to it. Finally, I hacked a naming 
> convention and the way the plugin uses the setConf() call so the plugin 
> will accept configuration for multiple different POJOs to set multiple 
> fields in the NutchDocument. I didn't see any examples of a plugin 
> running more than once for each document quite that way, so I'm not sure 
> if this conforms to whatever canonical approach might exist.
> 
> I think of this plugin as a way to extend the reach of the plugin 
> architecture's flexibility out to POJO-land :) for anyone who 
> can't/won't for whatever reason write a plugin of their own. The POJOs 
> have to accept a String in a constructor, but they don't work on 
> NutchDocument or CrawlDatum or anything. I think if the plugin wants to 
> pass all that to a POJO for reflection, it's a clever way to waste time 
> when the work could be done in the plugin itself. For some subset of 
> indexing requirements, I think this could be useful to a wider set of 
> users. Still, I'm not a wider set of users, so I'm asking here.
> 
> NUTCH-585 has a lot of discussion about a concern similar to what this 
> jsoup example enables and Solr itself includes the 
> URLClassifierProcessor that addresses the same type of task that the 
> regex example shows, so is there any interest in this kind of 
> generalized plugin? Just from those examples, it could enable some 
> altered version of those capabilities. I've only built and tested with 
> the 1.19 branch and main branch code so far, and only with a Solr 9.2.1 
> cloud install, 'cause that's what I'm running, but if it seems 
> worthwhile to others, I'll beef up the documentation and write JUnit cases.
> 
>   Thanks, stay safe, stay healthy,
> 
>   Joe
> 
> 


[DISCUSS] Release Nutch 1.20

2024-03-07 Thread lewis john mcgibbney
Hi dev@,
As of today, 51 issues have been addressed in the 1.20 development drive.
https://issues.apache.org/jira/projects/NUTCH/versions/12352190
I would like to push a release soon and ship it to the user community.
Any objections?
Thank you
lewismc


[jira] [Closed] (NUTCH-3024) Remove flaky 'dependency check' target

2023-11-24 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3024.
---

> Remove flaky 'dependency check' target
> --
>
> Key: NUTCH-3024
> URL: https://issues.apache.org/jira/browse/NUTCH-3024
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> I [started a 
> thread|https://lists.apache.org/thread/ol3ssjphdqqxwsxhc65qoqg1dj1kjbxb] 
> covering my observations running the ant _*dependency-check*_ target. It 
> fails unpredictably in both GitHub actions and our trusty Jenkins builds on 
> ci-builds.apache.org.
> I propose to simply remove this target (and associated configuration) in a 
> bid to clean up some flaky legacy build code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3024) Remove flaky 'dependency check' target

2023-11-24 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3024.
-
Resolution: Fixed

> Remove flaky 'dependency check' target
> --
>
> Key: NUTCH-3024
> URL: https://issues.apache.org/jira/browse/NUTCH-3024
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> I [started a 
> thread|https://lists.apache.org/thread/ol3ssjphdqqxwsxhc65qoqg1dj1kjbxb] 
> covering my observations running the ant _*dependency-check*_ target. It 
> fails unpredictably in both GitHub actions and our trusty Jenkins builds on 
> ci-builds.apache.org.
> I propose to simply remove this target (and associated configuration) in a 
> bid to clean up some flaky legacy build code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3007) Fix impossible casts

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3007.
---

> Fix impossible casts
> 
>
> Key: NUTCH-3007
> URL: https://issues.apache.org/jira/browse/NUTCH-3007
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Spotbugs reports two occurrences of
>   Impossible cast from java.util.ArrayList to String[] in 
> org.apache.nutch.fetcher.Fetcher.run(Map, String)
> Both were introduced later into the {{run(Map args, String 
> crawlId)}} method and obviously never used (would throw a 
> ClassCastException). The code blocks should be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2846) Fix various bugs spotted by NUTCH-2815

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2846.
---

> Fix various bugs spotted by NUTCH-2815
> --
>
> Key: NUTCH-2846
> URL: https://issues.apache.org/jira/browse/NUTCH-2846
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> This issue addresses various bugs spotted by Spotbugs (NUTCH-2815):
> - use static method Integer.parseInt(...)
> - use integer arithmetic instead of floating point with rounding floats 
> afterwards
> - erroneous declaration of constructor in BasicURLNormalizer
> - fix bracketing when calculating hash code of CrawlDatum



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2852) Method invokes System.exit(...) 9 bugs

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2852.
---

> Method invokes System.exit(...) 9 bugs
> --
>
> Key: NUTCH-2852
> URL: https://issues.apache.org/jira/browse/NUTCH-2852
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.18
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> org.apache.nutch.indexer.IndexingFiltersChecker since first historized release
> In class org.apache.nutch.indexer.IndexingFiltersChecker
> In method org.apache.nutch.indexer.IndexingFiltersChecker.run(String[])
> At IndexingFiltersChecker.java:[line 96]
> Another occurrence at IndexingFiltersChecker.java:[line 129]
> org.apache.nutch.indexer.IndexingFiltersChecker.run(String[]) invokes 
> System.exit(...), which shuts down the entire virtual machine
> Invoking System.exit shuts down the entire Java virtual machine. This should 
> only been done when it is appropriate. Such calls make it hard or impossible 
> for your code to be invoked by other code. Consider throwing a 
> RuntimeException instead.
> Also occurs in
>org.apache.nutch.net.URLFilterChecker since first historized release
>org.apache.nutch.net.URLNormalizerChecker since first historized release
>org.apache.nutch.parse.ParseSegment since first historized release
>org.apache.nutch.parse.ParserChecker since first historized release
>org.apache.nutch.service.NutchServer since first historized release
>org.apache.nutch.tools.CommonCrawlDataDumper since first historized release
>org.apache.nutch.tools.DmozParser since first historized release
>org.apache.nutch.util.AbstractChecker since first historized release 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2819) Move spotbugs "installation" directory to avoid that spotbugs is shipped in Nutch runtime

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2819.
---

> Move spotbugs "installation" directory to avoid that spotbugs is shipped in 
> Nutch runtime
> -
>
> Key: NUTCH-2819
> URL: https://issues.apache.org/jira/browse/NUTCH-2819
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Shashanka Balakuntala Srinivasa
>Priority: Minor
> Fix For: 1.19
>
>
> With NUTCH-2816 the Spotbugs tool is "installed" in lib/. However, files in 
> lib/ are copied to build/ and runtime/. To avoid that the spotbugs jars are 
> shipped in runtime and eventually also releases, the spotbugs installation 
> folder should be moved into a different directory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2851) Random object created and used only once

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2851.
---

> Random object created and used only once
> 
>
> Key: NUTCH-2851
> URL: https://issues.apache.org/jira/browse/NUTCH-2851
> Project: Nutch
>  Issue Type: Sub-task
>  Components: dmoz, generator, indexer, segment
>Affects Versions: 1.18
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.19
>
>
> In class org.apache.nutch.crawl.Generator
> In method org.apache.nutch.crawl.Generator.partitionSegment(Path, Path, int)
> Called method java.util.Random.nextInt()
> At Generator.java:[line 1016]
> Random object created and used only once in 
> org.apache.nutch.crawl.Generator.partitionSegment(Path, Path, int)
> This code creates a java.util.Random object, uses it to generate one random 
> number, and then discards the Random object. This produces mediocre quality 
> random numbers and is inefficient. If possible, rewrite the code so that the 
> Random object is created once and saved, and each time a new random number is 
> required invoke a method on the existing Random object to obtain it.
> If it is important that the generated Random numbers not be guessable, you 
> must not create a new Random for each random number; the values are too 
> easily guessable. You should strongly consider using a 
> java.security.SecureRandom instead (and avoid allocating a new SecureRandom 
> for each random number needed).
> This bad practice also affects the following
> org.apache.nutch.indexer.IndexingJob since first historized release
> org.apache.nutch.segment.SegmentReader since first historized release
> org.apache.nutch.tools.DmozParser$RDFProcessor since first historized release 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2850) Method ignores exceptional return value

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2850.
---

> Method ignores exceptional return value
> ---
>
> Key: NUTCH-2850
> URL: https://issues.apache.org/jira/browse/NUTCH-2850
> Project: Nutch
>  Issue Type: Sub-task
>  Components: dumpers
>Affects Versions: 1.18
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.19
>
>
> In class org.apache.nutch.tools.FileDumper
> In method org.apache.nutch.tools.FileDumper.dump(File, File, String[], 
> boolean, boolean, boolean)
> Called method java.io.File.mkdirs()
> At FileDumper.java:[line 237]
> Exceptional return value of java.io.File.mkdirs() ignored in 
> org.apache.nutch.tools.FileDumper.dump(File, File, String[], boolean, 
> boolean, boolean)
> This method returns a value that is not checked. The return value should be 
> checked since it can indicate an unusual or unexpected function execution. 
> For example, the File.delete() method returns false if the file could not be 
> successfully deleted (rather than throwing an Exception). If you don't check 
> the result, you won't notice if the method invocation signals unexpected 
> behavior by returning an atypical return value. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3024) Remove flaky 'dependency check' target

2023-11-03 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3024:
---

 Summary: Remove flaky 'dependency check' target
 Key: NUTCH-3024
 URL: https://issues.apache.org/jira/browse/NUTCH-3024
 Project: Nutch
  Issue Type: Task
  Components: build
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


I [started a 
thread|https://lists.apache.org/thread/ol3ssjphdqqxwsxhc65qoqg1dj1kjbxb] 
covering my observations running the ant _*dependency-check*_ target. It fails 
unpredictably in both GitHub actions and our trusty Jenkins builds on 
ci-builds.apache.org.

I propose to simply remove this target (and associated configuration) in a bid 
to clean up some flaky legacy build code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Removing “dependency-check” target from build.xml

2023-11-03 Thread lewis john mcgibbney
Hi dev@,

Recently I was doing a bit of work on CI and made an attempt to activate
the “dependency-check” target (previously named “report-vulnerabilities”).

It appears that the underlying “dependency-check” tooling is flaky at best.
It appears to take an awful long time to execute and seems to be prone to
hanging.

I propose to remove this target and implement something more stable in the
future… when I work on finishing the Gradle build.

lewismc


[jira] [Created] (NUTCH-3023) Use mikepenz/action-junit-report to improve interpretation of failed tests during CI

2023-11-02 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3023:
---

 Summary: Use mikepenz/action-junit-report to improve 
interpretation of failed tests during CI
 Key: NUTCH-3023
 URL: https://issues.apache.org/jira/browse/NUTCH-3023
 Project: Nutch
  Issue Type: Task
  Components: build, test
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


The following GitHub action could help improve the interpretation of unit test 
anomalies during a CI run.

[https://github.com/mikepenz/action-junit-report]

Rather than having to grep through the GitHub Action log, one could save time 
by interpreting the comments posted to the PR conversation thread.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3014) Standardize Job names

2023-11-02 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3014.
---

Thanks [~snagel] for the review

> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3014) Standardize Job names

2023-11-02 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3014.
-
Resolution: Fixed

> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3022) Experiment formatting codebase per google-java-format

2023-11-02 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3022:
---

 Summary: Experiment formatting codebase per google-java-format
 Key: NUTCH-3022
 URL: https://issues.apache.org/jira/browse/NUTCH-3022
 Project: Nutch
  Issue Type: Task
  Components: build
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


I [started a mailing list 
thread|https://lists.apache.org/thread/ssmm6djyk5syvhmq701zjf0d9bobpk5n] which 
quizzed whether we should integrate code linting/formatting into the CI.

Seb provided some excellent, calculated input which inspired me to create this 
ticket.

I will create a PR which lints the Nutcj codebase per the *google-java-format* 
and discuss the results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3014) Standardize Job names

2023-11-02 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3014 stopped by Lewis John McGibbney.
---
> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Nutch codebase formatting

2023-11-02 Thread Lewis John McGibbney
Thanks Seb. I'll go ahead and try to build in the google Java format via 
super-linter and see where we get...!
lewismc

On 2023/10/29 17:04:47 Sebastian Nagel wrote:
> Hi Lewis,
> 
>  >> whether we need a Nutch custom code style at all… why don’t we just use
>  >> some other existing style and then enforce it?
> 
> Enforcing: yes!
> 
> However, I would try hard to keep the changes on a reasonable minimum. For 
> example, if we change the indentation, almost every code line is affected 
> which 
> makes
> - "git annotate" mostly useless (or more difficult to use because you need 
> look
>back)
> - merges of open PRs, custom patches or modifications in custom repositories
>might get quite painful, until the formatting is synchronized.
> 
> 
>  >> * google Java format [1] which offers a GitHub action for easy integration
>  >> into our CI process, or
> 
> +1
> 
> + available also for Intellij, Eclipse
> + indentation stays the same
> +/- about 25% of the code lines are changed (might be acceptable)
> 
> 
>  >> * superlinter [3] basically emerging as the industry OSS default, offers a
>  >> GitHub action and could also be configured to lint dockerfile, and other
>  >> artifacts. It can also be configured to use the google Java style as well…
> 
> +1 (with Google Java style)
> 
> 
>  > I’ll submit a PR for superlinter so everyone can see what it would look 
> like.
> 
> Great! Thanks!
> 
> 
> Best,
> Sebastian
> 
> On 10/29/23 00:38, Lewis John McGibbney wrote:
> > Any thoughts on this folks.
> > I’ll submit a PR for superlinter so everyone can see what it would look 
> > like.
> > lewismc
> > 
> > On 2023/10/23 19:28:45 lewis john mcgibbney wrote:
> >> Hi dev@,
> >>
> >> For the longest time the Nutch codebase has shipped with a
> >> eclipse-codeformat.xml [0] file.
> >> Whilst this has been largely successful in keeping the codebase uniform, it
> >> cannot/has not been integrated into continuous integration (CI)  and
> >> subsequently not really enforced!
> >>
> >> Whilst I’m a big fan of “if it ain’t broken don’t fix it”, I think we
> >> should have some CI code formatting checks. Additionally I really question
> >> whether we need a Nutch custom code style at all… why don’t we just use
> >> some other existing style and then enforce it?
> >>
> >> I therefore propose that we replace the legacy code formatter with a
> >> convention such as
> >>
> >> * google Java format [1] which offers a GitHub action for easy integration
> >> into our CI process, or
> >> * check style [2] which offers an Ant task which we could use, this is of
> >> less utility as we think about the move to grade
> >> * superlinter [3] basically emerging as the industry OSS default, offers a
> >> GitHub action and could also be configured to lint dockerfile, and other
> >> artifacts. It can also be configured to use the google Java style as well…
> >>
> >> My preference would be [3] because it offers a more comprehensive linting
> >> package for the entire codebase not just the Java code.
> >>
> >> Thanks for your consideration.
> >> lewismc
> >>
> >> [0]
> >> https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml
> >> [1]
> >> https://github.com/google/google-java-format
> >> [2]
> >> https://checkstyle.sourceforge.io/
> >> [3]
> >> https://github.com/marketplace/actions/super-linter
> >>
> 


Re: Nutch codebase formatting

2023-10-28 Thread Lewis John McGibbney
Any thoughts on this folks.
I’ll submit a PR for superlinter so everyone can see what it would look like.
lewismc 

On 2023/10/23 19:28:45 lewis john mcgibbney wrote:
> Hi dev@,
> 
> For the longest time the Nutch codebase has shipped with a
> eclipse-codeformat.xml [0] file.
> Whilst this has been largely successful in keeping the codebase uniform, it
> cannot/has not been integrated into continuous integration (CI)  and
> subsequently not really enforced!
> 
> Whilst I’m a big fan of “if it ain’t broken don’t fix it”, I think we
> should have some CI code formatting checks. Additionally I really question
> whether we need a Nutch custom code style at all… why don’t we just use
> some other existing style and then enforce it?
> 
> I therefore propose that we replace the legacy code formatter with a
> convention such as
> 
> * google Java format [1] which offers a GitHub action for easy integration
> into our CI process, or
> * check style [2] which offers an Ant task which we could use, this is of
> less utility as we think about the move to grade
> * superlinter [3] basically emerging as the industry OSS default, offers a
> GitHub action and could also be configured to lint dockerfile, and other
> artifacts. It can also be configured to use the google Java style as well…
> 
> My preference would be [3] because it offers a more comprehensive linting
> package for the entire codebase not just the Java code.
> 
> Thanks for your consideration.
> lewismc
> 
> [0]
> https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml
> [1]
> https://github.com/google/google-java-format
> [2]
> https://checkstyle.sourceforge.io/
> [3]
> https://github.com/marketplace/actions/super-linter
> 


[jira] [Work stopped] (NUTCH-3015) Add more CI steps to GitHub master-build.yml

2023-10-27 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3015 stopped by Lewis John McGibbney.
---
> Add more CI steps to GitHub master-build.yml
> 
>
> Key: NUTCH-3015
> URL: https://issues.apache.org/jira/browse/NUTCH-3015
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> With specific reference to the GitHub master-build.yml, we currently we run 
> _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if 
> something fails it is unclear as to exactly what.
>  
> There are several improvements I want to propose to the GitHub CI
>  * run workflows against in multiple Environments/OS e.g. ubuntu, macos & 
> windows
>  * define multiple jobs which can run in parallel to speed up CI e.g. javadoc 
> and nightly targets
>  * run more targets e.g. linting, rat-sources, report-vulnerabilities, 
> report-licenses, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3015) Add more CI steps to GitHub master-build.yml

2023-10-27 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3015.
---

> Add more CI steps to GitHub master-build.yml
> 
>
> Key: NUTCH-3015
> URL: https://issues.apache.org/jira/browse/NUTCH-3015
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> With specific reference to the GitHub master-build.yml, we currently we run 
> _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if 
> something fails it is unclear as to exactly what.
>  
> There are several improvements I want to propose to the GitHub CI
>  * run workflows against in multiple Environments/OS e.g. ubuntu, macos & 
> windows
>  * define multiple jobs which can run in parallel to speed up CI e.g. javadoc 
> and nightly targets
>  * run more targets e.g. linting, rat-sources, report-vulnerabilities, 
> report-licenses, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3015) Add more CI steps to GitHub master-build.yml

2023-10-27 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3015.
-
Resolution: Fixed

> Add more CI steps to GitHub master-build.yml
> 
>
> Key: NUTCH-3015
> URL: https://issues.apache.org/jira/browse/NUTCH-3015
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> With specific reference to the GitHub master-build.yml, we currently we run 
> _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if 
> something fails it is unclear as to exactly what.
>  
> There are several improvements I want to propose to the GitHub CI
>  * run workflows against in multiple Environments/OS e.g. ubuntu, macos & 
> windows
>  * define multiple jobs which can run in parallel to speed up CI e.g. javadoc 
> and nightly targets
>  * run more targets e.g. linting, rat-sources, report-vulnerabilities, 
> report-licenses, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >