[jira] [Closed] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-05-15 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3041.
---

> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-05-15 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3041 stopped by Lewis John McGibbney.
---
> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-05-15 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3041.
-
Resolution: Fixed

> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3054.
---

> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3054.
-
Resolution: Fixed

> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-29 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3054:

Affects Version/s: 1.20

> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3054:
---

 Summary: Address deprecation of Node16 for all GitHub Actions
 Key: NUTCH-3054
 URL: https://issues.apache.org/jira/browse/NUTCH-3054
 Project: Nutch
  Issue Type: Task
  Components: ci/cd
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.21


See 
[https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]

We need to upgrade the setup-java action in  
[https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
 

Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-29 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3054 started by Lewis John McGibbney.
---
> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3049) Investigate using Records

2024-04-29 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842208#comment-17842208
 ] 

Lewis John McGibbney commented on NUTCH-3049:
-

I think that each of the Writable classes mentioned in NutchWritable may be 
fair game

{{        org.apache.nutch.crawl.CrawlDatum.class,}}
{{        org.apache.nutch.crawl.Inlink.class,}}
{{        org.apache.nutch.crawl.Inlinks.class,}}
{{        org.apache.nutch.indexer.NutchIndexAction.class,}}
{{        org.apache.nutch.metadata.Metadata.class,}}
{{        org.apache.nutch.parse.Outlink.class,}}
{{        org.apache.nutch.parse.ParseText.class,}}
{{        org.apache.nutch.parse.ParseData.class,}}
{{        org.apache.nutch.parse.ParseImpl.class,}}
{{        org.apache.nutch.parse.ParseStatus.class,}}
{{        org.apache.nutch.protocol.Content.class,}}
{{        org.apache.nutch.protocol.ProtocolStatus.class,}}
{{        org.apache.nutch.scoring.webgraph.LinkDatum.class,}}
{{        org.apache.nutch.hostdb.HostDatum.class}}

> Investigate using Records
> -
>
> Key: NUTCH-3049
> URL: https://issues.apache.org/jira/browse/NUTCH-3049
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Guidance at [https://www.baeldung.com/java-migrate-8-to-17#records]
> i think there are multiple areas where we could use Records. This ticket will 
> document the opportunities and structure that work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3053) Upgrade build and CI to JDK17

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3053:
---

 Summary: Upgrade build and CI to JDK17
 Key: NUTCH-3053
 URL: https://issues.apache.org/jira/browse/NUTCH-3053
 Project: Nutch
  Issue Type: Sub-task
  Components: build, ci/cd
Reporter: Lewis John McGibbney


This will involves changes to
 * 
[https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
 * [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/]
 * [https://github.com/apache/nutch/blob/master/default.properties#L46]
 * [https://github.com/apache/nutch/blob/master/default.properties#L57]
 * We should also investigate any deprecation notices in the build output
 * [https://github.com/apache/nutch/blob/master/ivy/mvn.template#L128-L129]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3052) Investigate using sealed classes

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3052:
---

 Summary: Investigate using sealed classes
 Key: NUTCH-3052
 URL: https://issues.apache.org/jira/browse/NUTCH-3052
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance available at 
[https://www.baeldung.com/java-migrate-8-to-17#sealed-classes]

First document if and where sealed classes would add value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3051) Investigate using new pattern matching syntax in switch expressions

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3051:
---

 Summary: Investigate using new pattern matching syntax in switch 
expressions
 Key: NUTCH-3051
 URL: https://issues.apache.org/jira/browse/NUTCH-3051
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance available at 
[https://www.baeldung.com/java-migrate-8-to-17#2-switch-expressions]

Apparently we use switch in 35 files

[https://github.com/search?q=repo%3Aapache%2Fnutch+switch+language%3AJava=code=Java]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3050) Investigate use of the enhanced instanceof operator

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3050:
---

 Summary: Investigate use of the enhanced instanceof operator
 Key: NUTCH-3050
 URL: https://issues.apache.org/jira/browse/NUTCH-3050
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-enhanced-instanceof-operator]

Apparently we use instanceof operator in 50 files

[https://github.com/search?q=repo%3Aapache%2Fnutch%20instanceof=code]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3049) Investigate using Records

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3049:
---

 Summary: Investigate using Records
 Key: NUTCH-3049
 URL: https://issues.apache.org/jira/browse/NUTCH-3049
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance at [https://www.baeldung.com/java-migrate-8-to-17#records]

i think there are multiple areas where we could use Records. This ticket will 
document the opportunities and structure that work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3048) Investigate where/if new string utility methods could be used

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3048:
---

 Summary: Investigate where/if new string utility methods could be 
used
 Key: NUTCH-3048
 URL: https://issues.apache.org/jira/browse/NUTCH-3048
 Project: Nutch
  Issue Type: Sub-task
  Components: util
Reporter: Lewis John McGibbney


Guidance at [https://www.baeldung.com/java-migrate-8-to-17#3-new-string-methods]

We may be able to also revisit our usage of common-* libraries with tje goal of 
using native methods from JDK.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3047) Use multi-line text blocks

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3047:
---

 Summary: Use multi-line text blocks
 Key: NUTCH-3047
 URL: https://issues.apache.org/jira/browse/NUTCH-3047
 Project: Nutch
  Issue Type: Sub-task
  Components: CLI
Reporter: Lewis John McGibbney


Guidance available at 
[https://www.baeldung.com/java-migrate-8-to-17#2-text-block]

This will help to cleanup our CLI *usage()* messages at a bare minimum.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3046) Use compact strings

2024-04-29 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3046:

Description: 
Follow the guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]

It looks like there are 9 instances where we use _*char []*_

|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].

  was:
Follow the guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]

It looks like there are [9 instances where we use 
char[]|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].


> Use compact strings
> ---
>
> Key: NUTCH-3046
> URL: https://issues.apache.org/jira/browse/NUTCH-3046
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Follow the guidance at 
> [https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]
> It looks like there are 9 instances where we use _*char []*_
> |[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3046) Use compact strings

2024-04-28 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3046:
---

 Summary: Use compact strings
 Key: NUTCH-3046
 URL: https://issues.apache.org/jira/browse/NUTCH-3046
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Follow the guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]

It looks like there are [9 instances where we use 
char[]|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3045) Upgrade from Java 11 to 17

2024-04-28 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3045:
---

 Summary: Upgrade from Java 11 to 17
 Key: NUTCH-3045
 URL: https://issues.apache.org/jira/browse/NUTCH-3045
 Project: Nutch
  Issue Type: Task
  Components: build, ci/cd
Reporter: Lewis John McGibbney
 Fix For: 1.21


This parent issue will track and organize work pertaining to upgrading Nutch to 
JDK 17.

Premier support for Oracle JDK 11 ended 7 months ago (30 Sep 2023).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3042) Use GitHub cache action to improve CI execution time

2024-04-19 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3042:

Description: 
With the Ant+Ivy build architecture, the current GitHub actions workflow can 
and regularly does take over 20 minutes to complete. Dependency retrieval takes 
a significant amount of time.

I think we can address the above issue and dramatically reduce the CI runtime 
by utilizing the official [GitHiub cache 
action|[https://github.com/actions/cache]].

It appears however that the action does not support the Apache Ivy cache. Both 
Maven and Gradle are supported. I [created a 
discussion|[https://github.com/actions/cache/discussions/1381]] to get 
conformation.

In the case that we cannot implement a cache for the Ivy build system then we 
will need to come back to this issue once we migrate to Gradle.

  was:
With the Ant+Ivy build architecture, the current GitHub actions workflow can 
and regularly does take over 20 minutes to complete. Dependency retrieval takes 
a significant amount of time.

I think we can address the above issue and dramatically reduce the CI runtime 
by utilizing the official [GitHiub cache 
action|[https://github.com/actions/cache]].

It appears however that the action does not support the Apache Ivy cache. Both 
Maven and Gradle are supported. I created a discussion to get conformation if 
this is the case.

In the case that we cannot implement a cache for the Ivy build system then we 
will need to come back to this issue once we migrate to Gradle.


> Use GitHub cache action to improve CI execution time
> 
>
> Key: NUTCH-3042
> URL: https://issues.apache.org/jira/browse/NUTCH-3042
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.21
>
>
> With the Ant+Ivy build architecture, the current GitHub actions workflow can 
> and regularly does take over 20 minutes to complete. Dependency retrieval 
> takes a significant amount of time.
> I think we can address the above issue and dramatically reduce the CI runtime 
> by utilizing the official [GitHiub cache 
> action|[https://github.com/actions/cache]].
> It appears however that the action does not support the Apache Ivy cache. 
> Both Maven and Gradle are supported. I [created a 
> discussion|[https://github.com/actions/cache/discussions/1381]] to get 
> conformation.
> In the case that we cannot implement a cache for the Ivy build system then we 
> will need to come back to this issue once we migrate to Gradle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3042) Use GitHub cache action to improve CI execution time

2024-04-19 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3042:
---

 Summary: Use GitHub cache action to improve CI execution time
 Key: NUTCH-3042
 URL: https://issues.apache.org/jira/browse/NUTCH-3042
 Project: Nutch
  Issue Type: Task
  Components: ci/cd
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.21


With the Ant+Ivy build architecture, the current GitHub actions workflow can 
and regularly does take over 20 minutes to complete. Dependency retrieval takes 
a significant amount of time.

I think we can address the above issue and dramatically reduce the CI runtime 
by utilizing the official [GitHiub cache 
action|[https://github.com/actions/cache]].

It appears however that the action does not support the Apache Ivy cache. Both 
Maven and Gradle are supported. I created a discussion to get conformation if 
this is the case.

In the case that we cannot implement a cache for the Ivy build system then we 
will need to come back to this issue once we migrate to Gradle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3041 started by Lewis John McGibbney.
---
> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3041:

Description: 
URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently [URLExemptionFilters|#L47-L48]] provides the following logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
#0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation is actually configured to be used at runtime.

I will provide a patch for this.

  was:
URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently [URLExemptionFilters|#L47-L48]] provides the following logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
#0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation actually exists for a given URL.

I will provide a patch for this.


> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3041:

Description: 
URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently [URLExemptionFilters|#L47-L48]] provides the following logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
#0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation actually exists for a given URL.

I will provide a patch for this.

  was:
URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently 
[URLExemptionFilters|[https://github.com/apache/nutch/blob/271f92e11c39b7a3583cfcd8d664262cfac59674/src/java/org/apache/nutch/net/URLExemptionFilters.java#L47-L48]]
 provides some confusing INFO-level logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] 
Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation actually exists for a given URL.

I will provide a patch for this.


> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation actually exists for a given URL.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3041:
---

 Summary: Address confusing logging in 
o.a.n.net.URLExemptionFilters 
 Key: NUTCH-3041
 URL: https://issues.apache.org/jira/browse/NUTCH-3041
 Project: Nutch
  Issue Type: Task
  Components: net
Affects Versions: 1.19, 1.20
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.21


URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently 
[URLExemptionFilters|[https://github.com/apache/nutch/blob/271f92e11c39b7a3583cfcd8d664262cfac59674/src/java/org/apache/nutch/net/URLExemptionFilters.java#L47-L48]]
 provides some confusing INFO-level logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] 
Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation actually exists for a given URL.

I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-08 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3038.
-
Resolution: Fixed

> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-08 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3038.
---

Thanks [~snagel] 

> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-08 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3038 stopped by Lewis John McGibbney.
---
> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-05 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3038 started by Lewis John McGibbney.
---
> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-05 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3038:

Description: 
During the 1.20 release management dryrun I discovered the following issues 
which I think should be addressed in order to be satisfied with the release 
candidate
 # Update docker/README to remove broken badge
 # Upgrade alpine base image in docker/Dockerfile
 # Migrate CHANGES.txt to CHANGES.md
 # Upgrade apache parent pom version from 23 to 31
 # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
 # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
ivy/mvn.template
 # Remove miredot plugin usage from ivy/mvn.template

  was:
During the 1.20 release management dryrun I discovered the following issues 
which I think should be addressed in order to be satisfied with the release 
candidate
 # Update docker/README to remove broken badge
 # Upgrade alpine base image in docker/Dockerfile
 # Migrate CHANGES.txt to CHANGES.md
 # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
 # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
ivy/mvn.template
 # Remove miredot plugin usage from ivy/mvn.template


> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-05 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3038:
---

 Summary: Address issues discovered during 1.20 release management 
dryrun
 Key: NUTCH-3038
 URL: https://issues.apache.org/jira/browse/NUTCH-3038
 Project: Nutch
  Issue Type: Task
  Components: build, docker
Affects Versions: 1.20
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


During the 1.20 release management dryrun I discovered the following issues 
which I think should be addressed in order to be satisfied with the release 
candidate
 # Update docker/README to remove broken badge
 # Upgrade alpine base image in docker/Dockerfile
 # Migrate CHANGES.txt to CHANGES.md
 # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
 # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
ivy/mvn.template
 # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-04-04 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3032.
---

Thanks [~jglvary] and congratulations on your first contribution to Apache 
Nutch :)

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Assignee: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Fix For: 1.20
>
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3032:

Fix Version/s: 1.20

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Assignee: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Fix For: 1.20
>
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-3032:
---

Assignee: Joe Gilvary

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Assignee: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-2856) Implement a protocol-smb plugin based on hierynomus/smbj

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2856 stopped by Lewis John McGibbney.
---
> Implement a protocol-smb plugin based on hierynomus/smbj
> 
>
> Key: NUTCH-2856
> URL: https://issues.apache.org/jira/browse/NUTCH-2856
> Project: Nutch
>  Issue Type: New Feature
>  Components: external, plugin, protocol
>Reporter: Hiran Chaudhuri
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> The plugin protocol-smb advertized on 
> [https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral] actually 
> refers to the JCIFS library. According to this library's homepage 
> [https://www.jcifs.org/]:
> _If you're looking for the latest and greatest open source Java SMB library, 
> this is not it. JCIFS has been in maintenance-mode-only for several years and 
> although what it does support works fine (SMB1, NTLMv2, midlc, MSRPC and 
> various utility classes), jCIFS does not support the newer SMB2/3 variants of 
> the SMB protocol which is slowly becoming required (Windows 10 requires 
> SMB2/3). JCIFS only supports SMB1 but Microsoft has deprecated SMB1 in their 
> products. *So if SMB1 is disabled on your network, JCIFS' file related 
> operations will NOT work.*_
> Looking at 
> [https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1:|https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1]
> _Microsoft added SMB1 to the Windows Server 2012 R2 deprecation list in June 
> 2013. Windows Server 2016 and some versions of Windows 10 Fall Creators 
> Update do not have SMB1 installed by default._
> As a conclusion, the chances that SMB1 protocol is installed and/or 
> configured are getting vastly smaller. Therefore some migration towards 
> SMB2/3 is required. Luckily the JCIFS homepage lists alternatives:
>  * [jcifs-codelibs|https://github.com/codelibs/jcifs]
>  * [jcifs-ng|https://github.com/AgNO3/jcifs-ng]
>  * [smbj|https://github.com/hierynomus/smbj]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-2887) Migrate to JUnit 5 Jupiter

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2887 stopped by Lewis John McGibbney.
---
> Migrate to JUnit 5 Jupiter
> --
>
> Key: NUTCH-2887
> URL: https://issues.apache.org/jira/browse/NUTCH-2887
> Project: Nutch
>  Issue Type: Improvement
>  Components: test
> Environment: Migrate 
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> This effort is a bit of a beast. See the [JUnit migration 
> tips|https://junit.org/junit5/docs/current/user-guide/#migrating-from-junit4-tips]
>  for general guidance. A general grep for junit in src produces the following
> {code:bash}
> ./test/nutch-site.xml
> ./test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java
> ./test/org/apache/nutch/net/TestURLNormalizers.java
> ./test/org/apache/nutch/net/protocols/TestHttpDateFormat.java
> ./test/org/apache/nutch/net/TestURLFilters.java
> ./test/org/apache/nutch/util/TestStringUtil.java
> ./test/org/apache/nutch/util/TestSuffixStringMatcher.java
> ./test/org/apache/nutch/util/TestEncodingDetector.java
> ./test/org/apache/nutch/util/TestMimeUtil.java
> ./test/org/apache/nutch/util/TestPrefixStringMatcher.java
> ./test/org/apache/nutch/util/DumpFileUtilTest.java
> ./test/org/apache/nutch/util/TestNodeWalker.java
> ./test/org/apache/nutch/util/WritableTestUtils.java
> ./test/org/apache/nutch/util/TestTableUtil.java
> ./test/org/apache/nutch/util/TestURLUtil.java
> ./test/org/apache/nutch/util/TestGZIPUtils.java
> ./test/org/apache/nutch/parse/TestParseText.java
> ./test/org/apache/nutch/parse/TestOutlinks.java
> ./test/org/apache/nutch/parse/TestParseData.java
> ./test/org/apache/nutch/parse/TestOutlinkExtractor.java
> ./test/org/apache/nutch/parse/TestParserFactory.java
> ./test/org/apache/nutch/segment/TestSegmentMerger.java
> ./test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java
> ./test/org/apache/nutch/plugin/TestPluginSystem.java
> ./test/org/apache/nutch/fetcher/TestFetcher.java
> ./test/org/apache/nutch/protocol/TestProtocolFactory.java
> ./test/org/apache/nutch/protocol/TestContent.java
> ./test/org/apache/nutch/protocol/AbstractHttpProtocolPluginTest.java
> ./test/org/apache/nutch/crawl/TestCrawlDbFilter.java
> ./test/org/apache/nutch/crawl/TestTextProfileSignature.java
> ./test/org/apache/nutch/crawl/TestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestGenerator.java
> ./test/org/apache/nutch/crawl/TestAdaptiveFetchSchedule.java
> ./test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestSignatureFactory.java
> ./test/org/apache/nutch/crawl/ContinuousCrawlTestUtil.java
> ./test/org/apache/nutch/crawl/TestInjector.java
> ./test/org/apache/nutch/crawl/TestLinkDbMerger.java
> ./test/org/apache/nutch/crawl/TestCrawlDbMerger.java
> ./test/org/apache/nutch/service/TestNutchServer.java
> ./test/org/apache/nutch/metadata/TestMetadata.java
> ./test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
> ./test/org/apache/nutch/indexer/TestIndexingFilters.java
> ./test/org/apache/nutch/indexer/TestIndexerMapReduce.java
> ./bin/nutch
> ./plugin/scoring-orphan/src/test/org/apache/nutch/scoring/orphan/TestOrphanScoringFilter.java
> ./plugin/index-basic/src/test/org/apache/nutch/indexer/basic/TestBasicIndexingFilter.java
> ./plugin/urlfilter-domaindenylist/build.xml
> ./plugin/urlfilter-domaindenylist/src/test/org/apache/nutch/urlfilter/domaindenylist/TestDomainDenylistURLFilter.java
> ./plugin/protocol-imaps/plugin.xml
> ./plugin/protocol-imaps/ivy.xml
> ./plugin/protocol-imaps/lib/junit-4.13.jar
> ./plugin/protocol-imaps/lib/greenmail-junit4-1.6.0.jar
> ./plugin/protocol-imaps/lib/greenmail-1.6.0.jar
> ./plugin/protocol-imaps/src/test/org/apache/nutch/protocol/imaps/TestImaps.java
> ./plugin/protocol-file/build.xml
> ./plugin/protocol-file/src/test/org/apache/nutch/protocol/file/TestProtocolFile.java
> ./plugin/urlnormalizer-regex/build.xml
> ./plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java
> ./plugin/build-plugin.xml
> ./plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java
> ./plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java
> ./plugin/urlnormalizer-protocol/build.xml
> ./plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer/protocol/TestProtocolURLNormalizer.java
> ./plugin/urlfilter-prefix/src/test/org/apache/nutch/urlfilter/prefix/TestPrefixURLFilter.java
> ./plugin/urlfilter-suffix/src/test/org/apache/nutch/urlfilter/suffix/TestSuffixURLFilter.java
> ./plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
> 

[jira] [Closed] (NUTCH-2832) Create tutorial on sending Nutch logs to Elasticsearch

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2832.
---

> Create tutorial on sending Nutch logs to Elasticsearch
> --
>
> Key: NUTCH-2832
> URL: https://issues.apache.org/jira/browse/NUTCH-2832
> Project: Nutch
>  Issue Type: New Feature
>  Components: configuration, deployment
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> A while back I used to use [Chukwa|https://chukwa.apache.org/] for log 
> aggregation and analysis. Chukwa is now retired. 
> I a bit of research into directly logging Log4j2 into Elasticsearch and came 
> across 
> [log4j2-elasticsearch|https://github.com/rfoltyns/log4j2-elasticsearch] which 
> looks pretty simple.
> I'm going to have a crack at implementing this functionality as a 
> configuration option. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2832) Create tutorial on sending Nutch logs to Elasticsearch

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2832.
-
Resolution: Won't Fix

Given the license changes regarding the concerned backend I have no interest 
implementing this anymore. 

> Create tutorial on sending Nutch logs to Elasticsearch
> --
>
> Key: NUTCH-2832
> URL: https://issues.apache.org/jira/browse/NUTCH-2832
> Project: Nutch
>  Issue Type: New Feature
>  Components: configuration, deployment
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> A while back I used to use [Chukwa|https://chukwa.apache.org/] for log 
> aggregation and analysis. Chukwa is now retired. 
> I a bit of research into directly logging Log4j2 into Elasticsearch and came 
> across 
> [log4j2-elasticsearch|https://github.com/rfoltyns/log4j2-elasticsearch] which 
> looks pretty simple.
> I'm going to have a crack at implementing this functionality as a 
> configuration option. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3036.
-
Resolution: Fixed

> Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
> 
>
> Key: NUTCH-3036
> URL: https://issues.apache.org/jira/browse/NUTCH-3036
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, selenium
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> lib-selenium currently packages org.seleniumhq.selenium:selenium-java 
> *v4.7.2* but *v4.18.1* is available on Maven Central.
> This ticket will upgrade the java dependency and validate that both 
> protocol-selenium and protocol-interactiveselenium work as expected in local 
> mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3036.
---

> Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
> 
>
> Key: NUTCH-3036
> URL: https://issues.apache.org/jira/browse/NUTCH-3036
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, selenium
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> lib-selenium currently packages org.seleniumhq.selenium:selenium-java 
> *v4.7.2* but *v4.18.1* is available on Maven Central.
> This ticket will upgrade the java dependency and validate that both 
> protocol-selenium and protocol-interactiveselenium work as expected in local 
> mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3035) Update license and notice file for release of 1.20

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3035.
---

> Update license and notice file for release of 1.20 
> ---
>
> Key: NUTCH-3035
> URL: https://issues.apache.org/jira/browse/NUTCH-3035
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Close to the release of 1.20 the license and notice files should be updated 
> to contain all (third-party) licenses of all dependencies. Cf. NUTCH-2290 and 
> NUTCH-2981.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3035) Update license and notice file for release of 1.20

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3035.
-
Resolution: Fixed

> Update license and notice file for release of 1.20 
> ---
>
> Key: NUTCH-3035
> URL: https://issues.apache.org/jira/browse/NUTCH-3035
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Close to the release of 1.20 the license and notice files should be updated 
> to contain all (third-party) licenses of all dependencies. Cf. NUTCH-2290 and 
> NUTCH-2981.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3037.
-
Resolution: Fixed

> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3037.
---

> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-21 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3037 stopped by Lewis John McGibbney.
---
> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-21 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3037:

Flags: Patch

> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-21 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3037 started by Lewis John McGibbney.
---
> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-21 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3037:
---

 Summary: Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
 Key: NUTCH-3037
 URL: https://issues.apache.org/jira/browse/NUTCH-3037
 Project: Nutch
  Issue Type: Task
  Components: indexer-kafka
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, I 
therefore propose to upgrade.

I will also state that a _*kafka_2.13*_ artifact exists. This would demand that 
the underlying Scala version be also upgraded... but I think this should be 
addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-14 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3036 stopped by Lewis John McGibbney.
---
> Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
> 
>
> Key: NUTCH-3036
> URL: https://issues.apache.org/jira/browse/NUTCH-3036
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, selenium
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> lib-selenium currently packages org.seleniumhq.selenium:selenium-java 
> *v4.7.2* but *v4.18.1* is available on Maven Central.
> This ticket will upgrade the java dependency and validate that both 
> protocol-selenium and protocol-interactiveselenium work as expected in local 
> mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-14 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3036 started by Lewis John McGibbney.
---
> Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
> 
>
> Key: NUTCH-3036
> URL: https://issues.apache.org/jira/browse/NUTCH-3036
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, selenium
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> lib-selenium currently packages org.seleniumhq.selenium:selenium-java 
> *v4.7.2* but *v4.18.1* is available on Maven Central.
> This ticket will upgrade the java dependency and validate that both 
> protocol-selenium and protocol-interactiveselenium work as expected in local 
> mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-14 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3036:
---

 Summary: Upgrade org.seleniumhq.selenium:selenium-java dependency 
in lib-selenium
 Key: NUTCH-3036
 URL: https://issues.apache.org/jira/browse/NUTCH-3036
 Project: Nutch
  Issue Type: Improvement
  Components: selenium, plugin
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


lib-selenium currently packages org.seleniumhq.selenium:selenium-java *v4.7.2* 
but *v4.18.1* is available on Maven Central.

This ticket will upgrade the java dependency and validate that both 
protocol-selenium and protocol-interactiveselenium work as expected in local 
mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826776#comment-17826776
 ] 

Lewis John McGibbney commented on NUTCH-3029:
-

Hi [~martin.dj] [~markus17] it looks like we are missing some Javadoc

 
{quote} [javadoc] Standard Doclet version 11.0.22 {quote}
{quote} [javadoc] Building tree for all the packages and classes... 
 [javadoc] 
/home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:193:
 warning: no @param for url 
 [javadoc] public static String getHostName(String url) throws 
URISyntaxException { 
 [javadoc] ^ 
 [javadoc] 
/home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:193:
 warning: no @return 
 [javadoc] public static String getHostName(String url) throws 
URISyntaxException { 
 [javadoc] ^ 
 [javadoc] 
/home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:193:
 warning: no @throws for java.net.URISyntaxException 
 [javadoc] public static String getHostName(String url) throws 
URISyntaxException { 
 [javadoc] ^ 
 [javadoc] 
/home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:205:
 warning: no @return 
 [javadoc] public float getMaxInterval(Text url, float defaultMaxInterval){ 
 [javadoc] ^ 
 [javadoc] 
/home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:227:
 warning: no @return 
 [javadoc] public float getMinInterval(Text url, float defaultMinInterval){ 
{quote}
{quote} [javadoc] ^{quote}
 

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-13 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3033.
---

> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-13 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3033.
-
Resolution: Fixed

> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3033:

Due Date: 12/Mar/24  (was: 11/Mar/24)

> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3033 stopped by Lewis John McGibbney.
---
> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 currently 7 tests as of writing. Traditionally, developers have focused on 
providing unit tests on the plugin-level as opposed to the legacy plugin 
framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.
 # generally speaking, any reduction of code in the Nutch codebase through 
careful selection and dependence of well maintained, well tested 3rd party 
libraries would be a good thing for the Nutch codebase.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 # {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :). Generally 
speaking just familiarize ones-self with the legacy plugin framework and 
understand where the gaps are.
 # *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will 
provide an opportunity to identify gaps between what the legacy plugin 
framework does (and what Nutch) needs Vs what PF4J provides. Touch base with 
the PF4J community, describe the intention to replace the legacy Nutch plugin 
framework with PF4J. Obtain guidance on how to proceed. Document this all in 
the Nutch wiki. Create mapping of [legacy 
Classes|[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]]
 to [PF4J 
equivalents|[https://github.com/pf4j/pf4j/tree/master/pf4j/src/main/java/org/pf4j]].
 # {*}Restructure the legacy Nutch plugin package{*}: 
[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]
 # {*}Restructure each plugin in the plugins directory{*}: 
[https://github.com/apache/nutch/tree/master/src/plugin]
 # *Update Nutch plugin documentation* 
 # {*}Create/propose plugin utility toolings{*}: #4 in the motivation section 
states that developing plugins in clunky. A utility tool which streamlines the 
creation of new plugins would be ideal. For example, this could take the form 
of a [new bash script|[https://github.com/apache/nutch/tree/master/src/bin]] 
which prompts the developer for input and then generates the plugin skeleton. 
{*}This is a nice to have{*}.

h1. Google Summer of Code Details

This initiative is being proposed as a GSoC 2024 project. 

{*}Proposed Mentor{*}: [~lewismc] 

{*}Proposed Co-Mentor{*}:

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 currently 7 tests as of writing. Traditionally, developers have focused on 
providing unit tests on the plugin-level as opposed to the legacy plugin 
framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.
 # generally speaking, any reduction of code in the Nutch codebase through 
careful selection and dependence of well maintained, well tested 3rd party 
libraries would be a good thing for the Nutch codebase.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 # {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :). Generally 
speaking just familiarize ones-self with the legacy plugin framework and 
understand where the gaps are.
 # *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will 
provide an opportunity to identify gaps between what the legacy plugin 
framework does (and what Nutch) needs Vs what PF4J provides. Touch base with 
the PF4J community, describe the intention to replace the legacy Nutch plugin 
framework with PF4J. Obtain guidance on how to proceed. Document this all in 
the Nutch wiki. Create mapping of [legacy 
Classes|[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]]
 to [PF4J 
equivalents|[https://github.com/pf4j/pf4j/tree/master/pf4j/src/main/java/org/pf4j]].
 # {*}Restructure the legacy Nutch plugin package{*}: 
[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]
 # {*}Restructure each plugin in the plugins directory{*}: 
[https://github.com/apache/nutch/tree/master/src/plugin]

 
h1. Google Summer of Code Details

This initiative is being proposed as a GSoC 2024 project. 

{*}Proposed Mentor{*}: [~lewismc] 

{*}Proposed Co-Mentor{*}:

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 currently 7 tests as of writing. Traditionally, developers have focused on 
providing unit tests on the plugin-level as opposed to the legacy plugin 
framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.
 # generally speaking, any reduction of code in the Nutch codebase through 
careful selection and dependence of well maintained, well tested 3rd party 
libraries would be a good thing for the Nutch codebase.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 # {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :). Generally 
speaking just familiarize ones-self with the legacy plugin framework and 
understand where the gaps are.
 # *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will 
provide an opportunity to identify gaps between what the legacy plugin 
framework does (and what Nutch) needs Vs what PF4J provides. Touch base with 
the PF4J community, describe the intention to replace the legacy Nutch plugin 
framework with PF4J. Obtain guidance on how to proceed. Document this all in 
the Nutch wiki. Create mapping of [legacy 
Classes|[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]]
 to [PF4J 
equivalents|[https://github.com/pf4j/pf4j/tree/master/pf4j/src/main/java/org/pf4j]].
 # {*}Restructure the legacy Nutch plugin package{*}: 
[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]
 # {*}Restructure each plugin in the plugins directory{*}: 
[https://github.com/apache/nutch/tree/master/src/plugin]
 #  

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # 

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :).
 * *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will 
provide an opportunity to identify gaps between what the legacy plugin 
framework does (and what Nutch) needs Vs what PF4J provides. Touch base with 
the PF4J community, describe the intention to replace the legacy Nutch plugin 
framework with PF4J. Obtain guidance on how to proceed. Document this all in 
the Nutch wiki.
 *  

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch 

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :)
 *  

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are fairly well 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, \{*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :)
 *  

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 

[jira] [Created] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3034:
---

 Summary: Overhaul the legacy Nutch plugin framework and replace it 
with PF4J
 Key: NUTCH-3034
 URL: https://issues.apache.org/jira/browse/NUTCH-3034
 Project: Nutch
  Issue Type: Improvement
  Components: pf4j, plugin
Reporter: Lewis John McGibbney


h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are fairly well 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, \{*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :)
 *  

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-11 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3033:
---

 Summary: Upgrade Ivy to v2.5.2
 Key: NUTCH-3033
 URL: https://issues.apache.org/jira/browse/NUTCH-3033
 Project: Nutch
  Issue Type: Task
  Components: ivy
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.

[https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-11 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3033 started by Lewis John McGibbney.
---
> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3024) Remove flaky 'dependency check' target

2023-11-24 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3024.
---

> Remove flaky 'dependency check' target
> --
>
> Key: NUTCH-3024
> URL: https://issues.apache.org/jira/browse/NUTCH-3024
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> I [started a 
> thread|https://lists.apache.org/thread/ol3ssjphdqqxwsxhc65qoqg1dj1kjbxb] 
> covering my observations running the ant _*dependency-check*_ target. It 
> fails unpredictably in both GitHub actions and our trusty Jenkins builds on 
> ci-builds.apache.org.
> I propose to simply remove this target (and associated configuration) in a 
> bid to clean up some flaky legacy build code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3024) Remove flaky 'dependency check' target

2023-11-24 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3024.
-
Resolution: Fixed

> Remove flaky 'dependency check' target
> --
>
> Key: NUTCH-3024
> URL: https://issues.apache.org/jira/browse/NUTCH-3024
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> I [started a 
> thread|https://lists.apache.org/thread/ol3ssjphdqqxwsxhc65qoqg1dj1kjbxb] 
> covering my observations running the ant _*dependency-check*_ target. It 
> fails unpredictably in both GitHub actions and our trusty Jenkins builds on 
> ci-builds.apache.org.
> I propose to simply remove this target (and associated configuration) in a 
> bid to clean up some flaky legacy build code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3007) Fix impossible casts

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3007.
---

> Fix impossible casts
> 
>
> Key: NUTCH-3007
> URL: https://issues.apache.org/jira/browse/NUTCH-3007
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Spotbugs reports two occurrences of
>   Impossible cast from java.util.ArrayList to String[] in 
> org.apache.nutch.fetcher.Fetcher.run(Map, String)
> Both were introduced later into the {{run(Map args, String 
> crawlId)}} method and obviously never used (would throw a 
> ClassCastException). The code blocks should be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2846) Fix various bugs spotted by NUTCH-2815

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2846.
---

> Fix various bugs spotted by NUTCH-2815
> --
>
> Key: NUTCH-2846
> URL: https://issues.apache.org/jira/browse/NUTCH-2846
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> This issue addresses various bugs spotted by Spotbugs (NUTCH-2815):
> - use static method Integer.parseInt(...)
> - use integer arithmetic instead of floating point with rounding floats 
> afterwards
> - erroneous declaration of constructor in BasicURLNormalizer
> - fix bracketing when calculating hash code of CrawlDatum



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2852) Method invokes System.exit(...) 9 bugs

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2852.
---

> Method invokes System.exit(...) 9 bugs
> --
>
> Key: NUTCH-2852
> URL: https://issues.apache.org/jira/browse/NUTCH-2852
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.18
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> org.apache.nutch.indexer.IndexingFiltersChecker since first historized release
> In class org.apache.nutch.indexer.IndexingFiltersChecker
> In method org.apache.nutch.indexer.IndexingFiltersChecker.run(String[])
> At IndexingFiltersChecker.java:[line 96]
> Another occurrence at IndexingFiltersChecker.java:[line 129]
> org.apache.nutch.indexer.IndexingFiltersChecker.run(String[]) invokes 
> System.exit(...), which shuts down the entire virtual machine
> Invoking System.exit shuts down the entire Java virtual machine. This should 
> only been done when it is appropriate. Such calls make it hard or impossible 
> for your code to be invoked by other code. Consider throwing a 
> RuntimeException instead.
> Also occurs in
>org.apache.nutch.net.URLFilterChecker since first historized release
>org.apache.nutch.net.URLNormalizerChecker since first historized release
>org.apache.nutch.parse.ParseSegment since first historized release
>org.apache.nutch.parse.ParserChecker since first historized release
>org.apache.nutch.service.NutchServer since first historized release
>org.apache.nutch.tools.CommonCrawlDataDumper since first historized release
>org.apache.nutch.tools.DmozParser since first historized release
>org.apache.nutch.util.AbstractChecker since first historized release 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2819) Move spotbugs "installation" directory to avoid that spotbugs is shipped in Nutch runtime

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2819.
---

> Move spotbugs "installation" directory to avoid that spotbugs is shipped in 
> Nutch runtime
> -
>
> Key: NUTCH-2819
> URL: https://issues.apache.org/jira/browse/NUTCH-2819
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Shashanka Balakuntala Srinivasa
>Priority: Minor
> Fix For: 1.19
>
>
> With NUTCH-2816 the Spotbugs tool is "installed" in lib/. However, files in 
> lib/ are copied to build/ and runtime/. To avoid that the spotbugs jars are 
> shipped in runtime and eventually also releases, the spotbugs installation 
> folder should be moved into a different directory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2851) Random object created and used only once

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2851.
---

> Random object created and used only once
> 
>
> Key: NUTCH-2851
> URL: https://issues.apache.org/jira/browse/NUTCH-2851
> Project: Nutch
>  Issue Type: Sub-task
>  Components: dmoz, generator, indexer, segment
>Affects Versions: 1.18
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.19
>
>
> In class org.apache.nutch.crawl.Generator
> In method org.apache.nutch.crawl.Generator.partitionSegment(Path, Path, int)
> Called method java.util.Random.nextInt()
> At Generator.java:[line 1016]
> Random object created and used only once in 
> org.apache.nutch.crawl.Generator.partitionSegment(Path, Path, int)
> This code creates a java.util.Random object, uses it to generate one random 
> number, and then discards the Random object. This produces mediocre quality 
> random numbers and is inefficient. If possible, rewrite the code so that the 
> Random object is created once and saved, and each time a new random number is 
> required invoke a method on the existing Random object to obtain it.
> If it is important that the generated Random numbers not be guessable, you 
> must not create a new Random for each random number; the values are too 
> easily guessable. You should strongly consider using a 
> java.security.SecureRandom instead (and avoid allocating a new SecureRandom 
> for each random number needed).
> This bad practice also affects the following
> org.apache.nutch.indexer.IndexingJob since first historized release
> org.apache.nutch.segment.SegmentReader since first historized release
> org.apache.nutch.tools.DmozParser$RDFProcessor since first historized release 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2850) Method ignores exceptional return value

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2850.
---

> Method ignores exceptional return value
> ---
>
> Key: NUTCH-2850
> URL: https://issues.apache.org/jira/browse/NUTCH-2850
> Project: Nutch
>  Issue Type: Sub-task
>  Components: dumpers
>Affects Versions: 1.18
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.19
>
>
> In class org.apache.nutch.tools.FileDumper
> In method org.apache.nutch.tools.FileDumper.dump(File, File, String[], 
> boolean, boolean, boolean)
> Called method java.io.File.mkdirs()
> At FileDumper.java:[line 237]
> Exceptional return value of java.io.File.mkdirs() ignored in 
> org.apache.nutch.tools.FileDumper.dump(File, File, String[], boolean, 
> boolean, boolean)
> This method returns a value that is not checked. The return value should be 
> checked since it can indicate an unusual or unexpected function execution. 
> For example, the File.delete() method returns false if the file could not be 
> successfully deleted (rather than throwing an Exception). If you don't check 
> the result, you won't notice if the method invocation signals unexpected 
> behavior by returning an atypical return value. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3024) Remove flaky 'dependency check' target

2023-11-03 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3024:
---

 Summary: Remove flaky 'dependency check' target
 Key: NUTCH-3024
 URL: https://issues.apache.org/jira/browse/NUTCH-3024
 Project: Nutch
  Issue Type: Task
  Components: build
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


I [started a 
thread|https://lists.apache.org/thread/ol3ssjphdqqxwsxhc65qoqg1dj1kjbxb] 
covering my observations running the ant _*dependency-check*_ target. It fails 
unpredictably in both GitHub actions and our trusty Jenkins builds on 
ci-builds.apache.org.

I propose to simply remove this target (and associated configuration) in a bid 
to clean up some flaky legacy build code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3023) Use mikepenz/action-junit-report to improve interpretation of failed tests during CI

2023-11-02 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3023:
---

 Summary: Use mikepenz/action-junit-report to improve 
interpretation of failed tests during CI
 Key: NUTCH-3023
 URL: https://issues.apache.org/jira/browse/NUTCH-3023
 Project: Nutch
  Issue Type: Task
  Components: build, test
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


The following GitHub action could help improve the interpretation of unit test 
anomalies during a CI run.

[https://github.com/mikepenz/action-junit-report]

Rather than having to grep through the GitHub Action log, one could save time 
by interpreting the comments posted to the PR conversation thread.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3014) Standardize Job names

2023-11-02 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3014.
---

Thanks [~snagel] for the review

> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3014) Standardize Job names

2023-11-02 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3014.
-
Resolution: Fixed

> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3022) Experiment formatting codebase per google-java-format

2023-11-02 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3022:
---

 Summary: Experiment formatting codebase per google-java-format
 Key: NUTCH-3022
 URL: https://issues.apache.org/jira/browse/NUTCH-3022
 Project: Nutch
  Issue Type: Task
  Components: build
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


I [started a mailing list 
thread|https://lists.apache.org/thread/ssmm6djyk5syvhmq701zjf0d9bobpk5n] which 
quizzed whether we should integrate code linting/formatting into the CI.

Seb provided some excellent, calculated input which inspired me to create this 
ticket.

I will create a PR which lints the Nutcj codebase per the *google-java-format* 
and discuss the results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3014) Standardize Job names

2023-11-02 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3014 stopped by Lewis John McGibbney.
---
> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3015) Add more CI steps to GitHub master-build.yml

2023-10-27 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3015 stopped by Lewis John McGibbney.
---
> Add more CI steps to GitHub master-build.yml
> 
>
> Key: NUTCH-3015
> URL: https://issues.apache.org/jira/browse/NUTCH-3015
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> With specific reference to the GitHub master-build.yml, we currently we run 
> _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if 
> something fails it is unclear as to exactly what.
>  
> There are several improvements I want to propose to the GitHub CI
>  * run workflows against in multiple Environments/OS e.g. ubuntu, macos & 
> windows
>  * define multiple jobs which can run in parallel to speed up CI e.g. javadoc 
> and nightly targets
>  * run more targets e.g. linting, rat-sources, report-vulnerabilities, 
> report-licenses, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3015) Add more CI steps to GitHub master-build.yml

2023-10-27 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3015.
---

> Add more CI steps to GitHub master-build.yml
> 
>
> Key: NUTCH-3015
> URL: https://issues.apache.org/jira/browse/NUTCH-3015
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> With specific reference to the GitHub master-build.yml, we currently we run 
> _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if 
> something fails it is unclear as to exactly what.
>  
> There are several improvements I want to propose to the GitHub CI
>  * run workflows against in multiple Environments/OS e.g. ubuntu, macos & 
> windows
>  * define multiple jobs which can run in parallel to speed up CI e.g. javadoc 
> and nightly targets
>  * run more targets e.g. linting, rat-sources, report-vulnerabilities, 
> report-licenses, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3015) Add more CI steps to GitHub master-build.yml

2023-10-27 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3015.
-
Resolution: Fixed

> Add more CI steps to GitHub master-build.yml
> 
>
> Key: NUTCH-3015
> URL: https://issues.apache.org/jira/browse/NUTCH-3015
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> With specific reference to the GitHub master-build.yml, we currently we run 
> _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if 
> something fails it is unclear as to exactly what.
>  
> There are several improvements I want to propose to the GitHub CI
>  * run workflows against in multiple Environments/OS e.g. ubuntu, macos & 
> windows
>  * define multiple jobs which can run in parallel to speed up CI e.g. javadoc 
> and nightly targets
>  * run more targets e.g. linting, rat-sources, report-vulnerabilities, 
> report-licenses, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-2887) Migrate to JUnit 5 Jupiter

2023-10-24 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2887 started by Lewis John McGibbney.
---
> Migrate to JUnit 5 Jupiter
> --
>
> Key: NUTCH-2887
> URL: https://issues.apache.org/jira/browse/NUTCH-2887
> Project: Nutch
>  Issue Type: Improvement
>  Components: test
> Environment: Migrate 
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> This effort is a bit of a beast. See the [JUnit migration 
> tips|https://junit.org/junit5/docs/current/user-guide/#migrating-from-junit4-tips]
>  for general guidance. A general grep for junit in src produces the following
> {code:bash}
> ./test/nutch-site.xml
> ./test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java
> ./test/org/apache/nutch/net/TestURLNormalizers.java
> ./test/org/apache/nutch/net/protocols/TestHttpDateFormat.java
> ./test/org/apache/nutch/net/TestURLFilters.java
> ./test/org/apache/nutch/util/TestStringUtil.java
> ./test/org/apache/nutch/util/TestSuffixStringMatcher.java
> ./test/org/apache/nutch/util/TestEncodingDetector.java
> ./test/org/apache/nutch/util/TestMimeUtil.java
> ./test/org/apache/nutch/util/TestPrefixStringMatcher.java
> ./test/org/apache/nutch/util/DumpFileUtilTest.java
> ./test/org/apache/nutch/util/TestNodeWalker.java
> ./test/org/apache/nutch/util/WritableTestUtils.java
> ./test/org/apache/nutch/util/TestTableUtil.java
> ./test/org/apache/nutch/util/TestURLUtil.java
> ./test/org/apache/nutch/util/TestGZIPUtils.java
> ./test/org/apache/nutch/parse/TestParseText.java
> ./test/org/apache/nutch/parse/TestOutlinks.java
> ./test/org/apache/nutch/parse/TestParseData.java
> ./test/org/apache/nutch/parse/TestOutlinkExtractor.java
> ./test/org/apache/nutch/parse/TestParserFactory.java
> ./test/org/apache/nutch/segment/TestSegmentMerger.java
> ./test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java
> ./test/org/apache/nutch/plugin/TestPluginSystem.java
> ./test/org/apache/nutch/fetcher/TestFetcher.java
> ./test/org/apache/nutch/protocol/TestProtocolFactory.java
> ./test/org/apache/nutch/protocol/TestContent.java
> ./test/org/apache/nutch/protocol/AbstractHttpProtocolPluginTest.java
> ./test/org/apache/nutch/crawl/TestCrawlDbFilter.java
> ./test/org/apache/nutch/crawl/TestTextProfileSignature.java
> ./test/org/apache/nutch/crawl/TestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestGenerator.java
> ./test/org/apache/nutch/crawl/TestAdaptiveFetchSchedule.java
> ./test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestSignatureFactory.java
> ./test/org/apache/nutch/crawl/ContinuousCrawlTestUtil.java
> ./test/org/apache/nutch/crawl/TestInjector.java
> ./test/org/apache/nutch/crawl/TestLinkDbMerger.java
> ./test/org/apache/nutch/crawl/TestCrawlDbMerger.java
> ./test/org/apache/nutch/service/TestNutchServer.java
> ./test/org/apache/nutch/metadata/TestMetadata.java
> ./test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
> ./test/org/apache/nutch/indexer/TestIndexingFilters.java
> ./test/org/apache/nutch/indexer/TestIndexerMapReduce.java
> ./bin/nutch
> ./plugin/scoring-orphan/src/test/org/apache/nutch/scoring/orphan/TestOrphanScoringFilter.java
> ./plugin/index-basic/src/test/org/apache/nutch/indexer/basic/TestBasicIndexingFilter.java
> ./plugin/urlfilter-domaindenylist/build.xml
> ./plugin/urlfilter-domaindenylist/src/test/org/apache/nutch/urlfilter/domaindenylist/TestDomainDenylistURLFilter.java
> ./plugin/protocol-imaps/plugin.xml
> ./plugin/protocol-imaps/ivy.xml
> ./plugin/protocol-imaps/lib/junit-4.13.jar
> ./plugin/protocol-imaps/lib/greenmail-junit4-1.6.0.jar
> ./plugin/protocol-imaps/lib/greenmail-1.6.0.jar
> ./plugin/protocol-imaps/src/test/org/apache/nutch/protocol/imaps/TestImaps.java
> ./plugin/protocol-file/build.xml
> ./plugin/protocol-file/src/test/org/apache/nutch/protocol/file/TestProtocolFile.java
> ./plugin/urlnormalizer-regex/build.xml
> ./plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java
> ./plugin/build-plugin.xml
> ./plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java
> ./plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java
> ./plugin/urlnormalizer-protocol/build.xml
> ./plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer/protocol/TestProtocolURLNormalizer.java
> ./plugin/urlfilter-prefix/src/test/org/apache/nutch/urlfilter/prefix/TestPrefixURLFilter.java
> ./plugin/urlfilter-suffix/src/test/org/apache/nutch/urlfilter/suffix/TestSuffixURLFilter.java
> ./plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
> 

[jira] [Created] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2

2023-10-24 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3016:
---

 Summary: Upgrade Apache Ivy to 2.5.2
 Key: NUTCH-3016
 URL: https://issues.apache.org/jira/browse/NUTCH-3016
 Project: Nutch
  Issue Type: Task
  Components: ivy, build
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


[Apache Ivy v2.5.2|https://ant.apache.org/ivy/history/2.5.2/release-notes.html] 
was released on August 20 2023!

We should upgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-2887) Migrate to JUnit 5 Jupiter

2023-10-23 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-2887:
---

Assignee: Lewis John McGibbney

> Migrate to JUnit 5 Jupiter
> --
>
> Key: NUTCH-2887
> URL: https://issues.apache.org/jira/browse/NUTCH-2887
> Project: Nutch
>  Issue Type: Improvement
>  Components: test
> Environment: Migrate 
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> This effort is a bit of a beast. See the [JUnit migration 
> tips|https://junit.org/junit5/docs/current/user-guide/#migrating-from-junit4-tips]
>  for general guidance. A general grep for junit in src produces the following
> {code:bash}
> ./test/nutch-site.xml
> ./test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java
> ./test/org/apache/nutch/net/TestURLNormalizers.java
> ./test/org/apache/nutch/net/protocols/TestHttpDateFormat.java
> ./test/org/apache/nutch/net/TestURLFilters.java
> ./test/org/apache/nutch/util/TestStringUtil.java
> ./test/org/apache/nutch/util/TestSuffixStringMatcher.java
> ./test/org/apache/nutch/util/TestEncodingDetector.java
> ./test/org/apache/nutch/util/TestMimeUtil.java
> ./test/org/apache/nutch/util/TestPrefixStringMatcher.java
> ./test/org/apache/nutch/util/DumpFileUtilTest.java
> ./test/org/apache/nutch/util/TestNodeWalker.java
> ./test/org/apache/nutch/util/WritableTestUtils.java
> ./test/org/apache/nutch/util/TestTableUtil.java
> ./test/org/apache/nutch/util/TestURLUtil.java
> ./test/org/apache/nutch/util/TestGZIPUtils.java
> ./test/org/apache/nutch/parse/TestParseText.java
> ./test/org/apache/nutch/parse/TestOutlinks.java
> ./test/org/apache/nutch/parse/TestParseData.java
> ./test/org/apache/nutch/parse/TestOutlinkExtractor.java
> ./test/org/apache/nutch/parse/TestParserFactory.java
> ./test/org/apache/nutch/segment/TestSegmentMerger.java
> ./test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java
> ./test/org/apache/nutch/plugin/TestPluginSystem.java
> ./test/org/apache/nutch/fetcher/TestFetcher.java
> ./test/org/apache/nutch/protocol/TestProtocolFactory.java
> ./test/org/apache/nutch/protocol/TestContent.java
> ./test/org/apache/nutch/protocol/AbstractHttpProtocolPluginTest.java
> ./test/org/apache/nutch/crawl/TestCrawlDbFilter.java
> ./test/org/apache/nutch/crawl/TestTextProfileSignature.java
> ./test/org/apache/nutch/crawl/TestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestGenerator.java
> ./test/org/apache/nutch/crawl/TestAdaptiveFetchSchedule.java
> ./test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestSignatureFactory.java
> ./test/org/apache/nutch/crawl/ContinuousCrawlTestUtil.java
> ./test/org/apache/nutch/crawl/TestInjector.java
> ./test/org/apache/nutch/crawl/TestLinkDbMerger.java
> ./test/org/apache/nutch/crawl/TestCrawlDbMerger.java
> ./test/org/apache/nutch/service/TestNutchServer.java
> ./test/org/apache/nutch/metadata/TestMetadata.java
> ./test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
> ./test/org/apache/nutch/indexer/TestIndexingFilters.java
> ./test/org/apache/nutch/indexer/TestIndexerMapReduce.java
> ./bin/nutch
> ./plugin/scoring-orphan/src/test/org/apache/nutch/scoring/orphan/TestOrphanScoringFilter.java
> ./plugin/index-basic/src/test/org/apache/nutch/indexer/basic/TestBasicIndexingFilter.java
> ./plugin/urlfilter-domaindenylist/build.xml
> ./plugin/urlfilter-domaindenylist/src/test/org/apache/nutch/urlfilter/domaindenylist/TestDomainDenylistURLFilter.java
> ./plugin/protocol-imaps/plugin.xml
> ./plugin/protocol-imaps/ivy.xml
> ./plugin/protocol-imaps/lib/junit-4.13.jar
> ./plugin/protocol-imaps/lib/greenmail-junit4-1.6.0.jar
> ./plugin/protocol-imaps/lib/greenmail-1.6.0.jar
> ./plugin/protocol-imaps/src/test/org/apache/nutch/protocol/imaps/TestImaps.java
> ./plugin/protocol-file/build.xml
> ./plugin/protocol-file/src/test/org/apache/nutch/protocol/file/TestProtocolFile.java
> ./plugin/urlnormalizer-regex/build.xml
> ./plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java
> ./plugin/build-plugin.xml
> ./plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java
> ./plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java
> ./plugin/urlnormalizer-protocol/build.xml
> ./plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer/protocol/TestProtocolURLNormalizer.java
> ./plugin/urlfilter-prefix/src/test/org/apache/nutch/urlfilter/prefix/TestPrefixURLFilter.java
> ./plugin/urlfilter-suffix/src/test/org/apache/nutch/urlfilter/suffix/TestSuffixURLFilter.java
> ./plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
> 

[jira] [Work started] (NUTCH-3015) Add more CI steps to GitHub master-build.yml

2023-10-23 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3015 started by Lewis John McGibbney.
---
> Add more CI steps to GitHub master-build.yml
> 
>
> Key: NUTCH-3015
> URL: https://issues.apache.org/jira/browse/NUTCH-3015
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> With specific reference to the GitHub master-build.yml, we currently we run 
> _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if 
> something fails it is unclear as to exactly what.
>  
> There are several improvements I want to propose to the GitHub CI
>  * run workflows against in multiple Environments/OS e.g. ubuntu, macos & 
> windows
>  * define multiple jobs which can run in parallel to speed up CI e.g. javadoc 
> and nightly targets
>  * run more targets e.g. linting, rat-sources, report-vulnerabilities, 
> report-licenses, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3014) Standardize Job names

2023-10-23 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3014 started by Lewis John McGibbney.
---
> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3015) Add more CI steps to GitHub master-build.yml

2023-10-22 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3015:
---

 Summary: Add more CI steps to GitHub master-build.yml
 Key: NUTCH-3015
 URL: https://issues.apache.org/jira/browse/NUTCH-3015
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


With specific reference to the GitHub master-build.yml, we currently we run 
_*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if 
something fails it is unclear as to exactly what.

 

There are several improvements I want to propose to the GitHub CI
 * run workflows against in multiple Environments/OS e.g. ubuntu, macos & 
windows
 * define multiple jobs which can run in parallel to speed up CI e.g. javadoc 
and nightly targets
 * run more targets e.g. linting, rat-sources, report-vulnerabilities, 
report-licenses, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3014) Standardize Job names

2023-10-22 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3014:

Description: 
There is a large degree of variability when we set the job name}}{}}}

 

{{Job job = NutchJob.getInstance(getConf());}}

{{job.setJobName("read " + segment);}}

 

Some examples mention the job name, others don't. Some use upper case, others 
don't, etc.

I think we can standardize the NutchJob job names. This would help when 
filtering jobs in YARN ResourceManager UI as well.

I propose we implement the following convention
 * *Nutch* (mandatory) - static value which prepends the job name, assists with 
distinguishing the Job as a NutchJob and making it easily findable.
 * *${ClassName}* (mandatory) - literally the name of the Class the job is 
encoded in
 * *${additional info}* (optional) - value could further distinguish the type 
of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)

_{*}Nutch ${ClassName}{*}: *${additional info}*_

_Examples:_
 * _Nutch LinkRank: Inverter_
 * _Nutch CrawlDb: + $crawldb_
 * _Nutch LinkDbReader: + $linkdb_

Thanks for any suggestions/comments.

  was:
There is a large degree of variability when we set the job name}}{}}}

 

{{Job job = NutchJob.getInstance(getConf());}}

{{job.setJobName("read " + segment);}}

 

Some examples mention the job name, others don't. Some use upper case, others 
don't, etc.

I think we can standardize the NutchJob job names. This would help when 
filtering jobs in YARN ResourceManager UI as well.

I propose we implement the following convention
 * *Nutch* (mandatory) - static value which prepends the job name, assists with 
distinguishing the Job as a NutchJob and making it easily findable.
 * *${ClassName}* (mandatory) - literally the name of the Class the job is 
encoded in
 * *${additional info}* (optional) - value could further distinguish the type 
of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)

_{*}Nutch ${ClassName}{*}: *${additional info}*_

_Examples:_
 * _Nutch LinkRank Inverter_
 * _Nutch CrawlDb + $crawldb_
 * _Nutch LinkDbReader + $linkdb_

Thanks for any suggestions/comments.


> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3014) Standardize Job names

2023-10-22 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3014:

Description: 
There is a large degree of variability when we set the job name}}{}}}

 

{{Job job = NutchJob.getInstance(getConf());}}

{{job.setJobName("read " + segment);}}

 

Some examples mention the job name, others don't. Some use upper case, others 
don't, etc.

I think we can standardize the NutchJob job names. This would help when 
filtering jobs in YARN ResourceManager UI as well.

I propose we implement the following convention
 * *Nutch* (mandatory) - static value which prepends the job name, assists with 
distinguishing the Job as a NutchJob and making it easily findable.
 * *${ClassName}* (mandatory) - literally the name of the Class the job is 
encoded in
 * *${additional info}* (optional) - value could further distinguish the type 
of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)

_{*}Nutch ${ClassName}{*}: *${additional info}*_

_Examples:_
 * _Nutch LinkRank Inverter_
 * _Nutch CrawlDb + $crawldb_
 * _Nutch LinkDbReader + $linkdb_

Thanks for any suggestions/comments.

  was:
There is a large degree of variability when we set the job name{{{}{}}}

 

{{Job job = NutchJob.getInstance(getConf());}}

{{job.setJobName("read " + segment);}}

 

Some examples mention the job name, others don't. Some use upper case, others 
don't, etc.

I think we can standardize the NutchJob job names. This would help when 
filtering jobs in YARN ResourceManager UI as well.

I propose we implement the following convention
 * *Nutch* (mandatory) - static value which prepends the job name, assists with 
distinguishing the Job as a NutchJob and making it easily findable.
 * *${ClassName}* (mandatory) - literally the name of the Class the job is 
encoded in
 * *${additional info}* (optional) - value could further distinguish the type 
of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)

_*Nutch ${ClassName}* *${additional info}*_

_Examples:_
 * _Nutch LinkRank Inverter_
 * _Nutch CrawlDb + $crawldb_
 * _Nutch LinkDbReader + $linkdb_

Thanks for any suggestions/comments.


> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank Inverter_
>  * _Nutch CrawlDb + $crawldb_
>  * _Nutch LinkDbReader + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3014) Standardize Job names

2023-10-22 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3014:

Summary: Standardize Job names  (was: Standardize NutchJob job names)

> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name{{{}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _*Nutch ${ClassName}* *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank Inverter_
>  * _Nutch CrawlDb + $crawldb_
>  * _Nutch LinkDbReader + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3013) Employ commons-lang3's StopWatch to simplify timing logic

2023-10-21 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3013.
-
Resolution: Fixed

Thanks for the review [~snagel] 

> Employ commons-lang3's StopWatch to simplify timing logic
> -
>
> Key: NUTCH-3013
> URL: https://issues.apache.org/jira/browse/NUTCH-3013
> Project: Nutch
>  Issue Type: Improvement
>  Components: logging, runtime, util
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: timing
> Fix For: 1.20
>
>
> I ended up running some experiments integrating Nutch and [Celeborn 
> (Incubating)|https://celeborn.apache.org/] and it got me thinking about 
> runtime timings. After some investigation I came across [common-lang3's 
> StopWatch 
> Class|https://commons.apache.org/proper/commons-lang/javadocs/api-release/index.html?org/apache/commons/lang3/time/StopWatch.html]
>  which provides a convenient API for timings.
> Seeing as we already declare the commons-lang3 dependency, I think StopWatch 
> could help us clean up some timing logic in Nutch. Specifically, it would 
> reduce redundancy in terms of duplicated code and logic. It would also open 
> the door to introduce timing _*splits*_ if anyone is so inclined to dig 
> deeper into runtime timings.
> A cursory search for *_"long start = System.currentTimeMillis();"_* returns 
> hits for 32 files so it's fair to say that timing already affects lots of 
> aspects of the Nutch execution workflow.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3013) Employ commons-lang3's StopWatch to simplify timing logic

2023-10-21 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3013.
---

> Employ commons-lang3's StopWatch to simplify timing logic
> -
>
> Key: NUTCH-3013
> URL: https://issues.apache.org/jira/browse/NUTCH-3013
> Project: Nutch
>  Issue Type: Improvement
>  Components: logging, runtime, util
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: timing
> Fix For: 1.20
>
>
> I ended up running some experiments integrating Nutch and [Celeborn 
> (Incubating)|https://celeborn.apache.org/] and it got me thinking about 
> runtime timings. After some investigation I came across [common-lang3's 
> StopWatch 
> Class|https://commons.apache.org/proper/commons-lang/javadocs/api-release/index.html?org/apache/commons/lang3/time/StopWatch.html]
>  which provides a convenient API for timings.
> Seeing as we already declare the commons-lang3 dependency, I think StopWatch 
> could help us clean up some timing logic in Nutch. Specifically, it would 
> reduce redundancy in terms of duplicated code and logic. It would also open 
> the door to introduce timing _*splits*_ if anyone is so inclined to dig 
> deeper into runtime timings.
> A cursory search for *_"long start = System.currentTimeMillis();"_* returns 
> hits for 32 files so it's fair to say that timing already affects lots of 
> aspects of the Nutch execution workflow.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3014) Standardize NutchJob job names

2023-10-21 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3014:
---

 Summary: Standardize NutchJob job names
 Key: NUTCH-3014
 URL: https://issues.apache.org/jira/browse/NUTCH-3014
 Project: Nutch
  Issue Type: Improvement
  Components: configuration, runtime
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


There is a large degree of variability when we set the job name{{{}{}}}

 

{{Job job = NutchJob.getInstance(getConf());}}

{{job.setJobName("read " + segment);}}

 

Some examples mention the job name, others don't. Some use upper case, others 
don't, etc.

I think we can standardize the NutchJob job names. This would help when 
filtering jobs in YARN ResourceManager UI as well.

I propose we implement the following convention
 * *Nutch* (mandatory) - static value which prepends the job name, assists with 
distinguishing the Job as a NutchJob and making it easily findable.
 * *${ClassName}* (mandatory) - literally the name of the Class the job is 
encoded in
 * *${additional info}* (optional) - value could further distinguish the type 
of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)

_*Nutch ${ClassName}* *${additional info}*_

_Examples:_
 * _Nutch LinkRank Inverter_
 * _Nutch CrawlDb + $crawldb_
 * _Nutch LinkDbReader + $linkdb_

Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3013) Employ commons-lang3's StopWatch to simplify timing logic

2023-10-20 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3013 started by Lewis John McGibbney.
---
> Employ commons-lang3's StopWatch to simplify timing logic
> -
>
> Key: NUTCH-3013
> URL: https://issues.apache.org/jira/browse/NUTCH-3013
> Project: Nutch
>  Issue Type: Improvement
>  Components: logging, runtime, util
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: timing
> Fix For: 1.20
>
>
> I ended up running some experiments integrating Nutch and [Celeborn 
> (Incubating)|https://celeborn.apache.org/] and it got me thinking about 
> runtime timings. After some investigation I came across [common-lang3's 
> StopWatch 
> Class|https://commons.apache.org/proper/commons-lang/javadocs/api-release/index.html?org/apache/commons/lang3/time/StopWatch.html]
>  which provides a convenient API for timings.
> Seeing as we already declare the commons-lang3 dependency, I think StopWatch 
> could help us clean up some timing logic in Nutch. Specifically, it would 
> reduce redundancy in terms of duplicated code and logic. It would also open 
> the door to introduce timing _*splits*_ if anyone is so inclined to dig 
> deeper into runtime timings.
> A cursory search for *_"long start = System.currentTimeMillis();"_* returns 
> hits for 32 files so it's fair to say that timing already affects lots of 
> aspects of the Nutch execution workflow.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3013) Employ commons-lang3's StopWatch to simplify timing logic

2023-10-20 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3013:
---

 Summary: Employ commons-lang3's StopWatch to simplify timing logic
 Key: NUTCH-3013
 URL: https://issues.apache.org/jira/browse/NUTCH-3013
 Project: Nutch
  Issue Type: Improvement
  Components: logging, runtime, util
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


I ended up running some experiments integrating Nutch and [Celeborn 
(Incubating)|https://celeborn.apache.org/] and it got me thinking about runtime 
timings. After some investigation I came across [common-lang3's StopWatch 
Class|https://commons.apache.org/proper/commons-lang/javadocs/api-release/index.html?org/apache/commons/lang3/time/StopWatch.html]
 which provides a convenient API for timings.

Seeing as we already declare the commons-lang3 dependency, I think StopWatch 
could help us clean up some timing logic in Nutch. Specifically, it would 
reduce redundancy in terms of duplicated code and logic. It would also open the 
door to introduce timing _*splits*_ if anyone is so inclined to dig deeper into 
runtime timings.

A cursory search for *_"long start = System.currentTimeMillis();"_* returns 
hits for 32 files so it's fair to say that timing already affects lots of 
aspects of the Nutch execution workflow.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-2856) Implement a protocol-smb plugin based on hierynomus/smbj

2023-02-28 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-2856:
---

Assignee: (was: Lewis John McGibbney)

> Implement a protocol-smb plugin based on hierynomus/smbj
> 
>
> Key: NUTCH-2856
> URL: https://issues.apache.org/jira/browse/NUTCH-2856
> Project: Nutch
>  Issue Type: New Feature
>  Components: external, plugin, protocol
>Reporter: Hiran Chaudhuri
>Priority: Major
> Fix For: 1.20
>
>
> The plugin protocol-smb advertized on 
> [https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral] actually 
> refers to the JCIFS library. According to this library's homepage 
> [https://www.jcifs.org/]:
> _If you're looking for the latest and greatest open source Java SMB library, 
> this is not it. JCIFS has been in maintenance-mode-only for several years and 
> although what it does support works fine (SMB1, NTLMv2, midlc, MSRPC and 
> various utility classes), jCIFS does not support the newer SMB2/3 variants of 
> the SMB protocol which is slowly becoming required (Windows 10 requires 
> SMB2/3). JCIFS only supports SMB1 but Microsoft has deprecated SMB1 in their 
> products. *So if SMB1 is disabled on your network, JCIFS' file related 
> operations will NOT work.*_
> Looking at 
> [https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1:|https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1]
> _Microsoft added SMB1 to the Windows Server 2012 R2 deprecation list in June 
> 2013. Windows Server 2016 and some versions of Windows 10 Fall Creators 
> Update do not have SMB1 installed by default._
> As a conclusion, the chances that SMB1 protocol is installed and/or 
> configured are getting vastly smaller. Therefore some migration towards 
> SMB2/3 is required. Luckily the JCIFS homepage lists alternatives:
>  * [jcifs-codelibs|https://github.com/codelibs/jcifs]
>  * [jcifs-ng|https://github.com/AgNO3/jcifs-ng]
>  * [smbj|https://github.com/hierynomus/smbj]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2988) Elasticsearch 7.13.2 compatible with ASL 2.0?

2023-02-28 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694741#comment-17694741
 ] 

Lewis John McGibbney commented on NUTCH-2988:
-

Actually, digging deeper it looks like the v7.13.2 we consume is licensed under 
[Elastic License 
2.0|https://raw.githubusercontent.com/elastic/elasticsearch/v7.13.2/licenses/ELASTIC-LICENSE-2.0.txt].
 This is confirmed by
# 
https://central.sonatype.com/artifact/org.elasticsearch.client/elasticsearch-rest-high-level-client/7.13.2,
 and
# 
https://mvnrepository.com/artifact/org.elasticsearch.client/elasticsearch-rest-high-level-client/7.13.2

> Elasticsearch 7.13.2 compatible with ASL 2.0?
> -
>
> Key: NUTCH-2988
> URL: https://issues.apache.org/jira/browse/NUTCH-2988
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> In the latest release of at least the 1.x branch of Nutch, the elasticsearch 
> high level java client is at 7.13.2, which is after the great schism.  Or, 
> the last purely ASL 2.0 license was in 7.10.2.
> So, do we need to downgrade to 7.10.2 or is Elasticsearch's new licensing 
> plan suitable to be released within an ASF project?
> Or, is the client as opposed to the main search project still actually ASL 
> 2.0?
> Ref: https://github.com/elastic/elasticsearch/blob/v7.13.2/LICENSE.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2988) Elasticsearch 7.13.2 compatible with ASL 2.0?

2023-02-28 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694736#comment-17694736
 ] 

Lewis John McGibbney commented on NUTCH-2988:
-

It looks the the [elasticsearch-java 
client|https://github.com/elastic/elasticsearch-java/blob/v8.6.2/LICENSE.txt]'s 
are licensed under ALv2.0.

> Elasticsearch 7.13.2 compatible with ASL 2.0?
> -
>
> Key: NUTCH-2988
> URL: https://issues.apache.org/jira/browse/NUTCH-2988
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> In the latest release of at least the 1.x branch of Nutch, the elasticsearch 
> high level java client is at 7.13.2, which is after the great schism.  Or, 
> the last purely ASL 2.0 license was in 7.10.2.
> So, do we need to downgrade to 7.10.2 or is Elasticsearch's new licensing 
> plan suitable to be released within an ASF project?
> Or, is the client still actually ASL 2.0?
> Ref: https://github.com/elastic/elasticsearch/blob/v7.13.2/LICENSE.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2940) Develop Gradle Core Build for Apache Nutch

2022-06-15 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554866#comment-17554866
 ] 

Lewis John McGibbney commented on NUTCH-2940:
-

WIP PR available at https://github.com/apache/nutch/pull/735

> Develop Gradle Core Build for Apache Nutch
> --
>
> Key: NUTCH-2940
> URL: https://issues.apache.org/jira/browse/NUTCH-2940
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Reporter: James Simmons
>Assignee: Lewis John McGibbney
>Priority: Major
>
> This issue will focus on the build lifecycle management for the core build of 
> Apache Nutch as seen here: 
> [https://github.com/apache/nutch/tree/master/src/java|https://github.com/apache/nutch/tree/master/src/java]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


  1   2   3   4   5   6   7   8   9   10   >