[jira] [Created] (NUTCH-3059) Generator: selector job does not count reduce output records

2024-06-05 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3059:
--

 Summary: Generator: selector job does not count reduce output 
records
 Key: NUTCH-3059
 URL: https://issues.apache.org/jira/browse/NUTCH-3059
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.21


The selector step (job) of the Generator does not count the reduce output 
records resp. shows the count "0":
{noformat}
2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: starting

2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: selecting 
best-scoring urls due for fetch.
...
         Map-Reduce Framework
                Map input records=6
                Map output records=6
                ...
                Combine input records=0
                Combine output records=0
                Reduce input groups=1
                Reduce shuffle bytes=594
                Reduce input records=6
                Reduce output records=0
                Spilled Records=12
                ...
{noformat}
Not a big issue but should investigate why this happens. The other counters 
seem to work properly, also the partitioner job shows the reduce output 
records. The issue is observed in local and distributed mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3058) Fetcher: counter for hung threads

2024-06-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852421#comment-17852421
 ] 

ASF GitHub Bot commented on NUTCH-3058:
---

sebastian-nagel opened a new pull request, #820:
URL: https://github.com/apache/nutch/pull/820

   - count the number of hung threads in a fetcher job
   - log and count the number of fetch items still queued when the "hard" 
timeout is reached




> Fetcher: counter for hung threads
> -
>
> Key: NUTCH-3058
> URL: https://issues.apache.org/jira/browse/NUTCH-3058
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> The Fetcher class defines a "hard" timeout defined as 50% of the MapReduce 
> task timeout, see {{mapreduce.task.timeout}} and 
> {{fetcher.threads.timeout.divisor}}. If there are fetcher threads running but 
> without any progress during the timeout period (in terms of newly started 
> fetch items), Fetcher is shut down to avoid that the task timeout is reached 
> and the fetcher job is failed. The "hung threads" are logged together with 
> the URL being fetched and (DEBUG level) the Java stack.
> In addition to logging, a job counter should indicate the number of hung 
> threads. This would allow to see on the job level whether there are issues 
> with hung threads. To trace the issues it's still required to look into the 
> Hadoop task logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3058) Fetcher: counter for hung threads

2024-06-05 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3058:
--

 Summary: Fetcher: counter for hung threads
 Key: NUTCH-3058
 URL: https://issues.apache.org/jira/browse/NUTCH-3058
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.20
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21


The Fetcher class defines a "hard" timeout defined as 50% of the MapReduce task 
timeout, see {{mapreduce.task.timeout}} and 
{{fetcher.threads.timeout.divisor}}. If there are fetcher threads running but 
without any progress during the timeout period (in terms of newly started fetch 
items), Fetcher is shut down to avoid that the task timeout is reached and the 
fetcher job is failed. The "hung threads" are logged together with the URL 
being fetched and (DEBUG level) the Java stack.

In addition to logging, a job counter should indicate the number of hung 
threads. This would allow to see on the job level whether there are issues with 
hung threads. To trace the issues it's still required to look into the Hadoop 
task logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails

2024-05-28 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850039#comment-17850039
 ] 

Hudson commented on NUTCH-3044:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #163 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/163/])
NUTCH-3044 Generator: NPE when extracting the host part of a URL fails (snagel: 
[https://github.com/apache/nutch/commit/4b263533a9cdea208383fdbb0a8cc0b537423d7f])
* (edit) src/java/org/apache/nutch/crawl/Generator.java
NUTCH-3044 Generator: NPE when extracting the host part of a URL fails (snagel: 
[https://github.com/apache/nutch/commit/4729786e4d7f9e1136580ceb191274862d03ba5b])
* (edit) src/test/org/apache/nutch/crawl/TestGenerator.java
NUTCH-3044 Generator: NPE when extracting the host part of a URL fails (snagel: 
[https://github.com/apache/nutch/commit/b153279ad5844b32560ecf62a8e7f83f8ecbd43c])
* (edit) src/java/org/apache/nutch/crawl/Generator.java
* (edit) src/test/org/apache/nutch/crawl/TestGenerator.java


> Generator: NPE when extracting the host part of a URL fails
> ---
>
> Key: NUTCH-3044
> URL: https://issues.apache.org/jira/browse/NUTCH-3044
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> When extracting the host part of a URL fails, the Generator job fails because 
> of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb 
> contains an malformed URL, for example, a URL with an unsupported scheme 
> (smb://).
> {noformat}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439)
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3055) README: fix Github "hub" commands

2024-05-28 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850040#comment-17850040
 ] 

Hudson commented on NUTCH-3055:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #163 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/163/])
NUTCH-3055 README: fix Github "hub" commands (snagel: 
[https://github.com/apache/nutch/commit/ca03d9b76485b7c9d50dff2c3946bb8189daf5e1])
* (edit) README.md


> README: fix Github "hub" commands
> -
>
> Key: NUTCH-3055
> URL: https://issues.apache.org/jira/browse/NUTCH-3055
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.21
>
>
> The [README.md|https://github.com/apache/nutch/blob/master/README.md] 
> contains [Github hub|https://hub.github.com/] commands but with "git" as 
> command (executable) name, maybe an alias or some other magic. However, if 
> hub isn't installed, these commands fail with {{git: 'pull-request' is not a 
> git command. See 'git --help'.}} or similar.
> We should use the command "hub" instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3055) README: fix Github "hub" commands

2024-05-28 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3055.

Resolution: Fixed

> README: fix Github "hub" commands
> -
>
> Key: NUTCH-3055
> URL: https://issues.apache.org/jira/browse/NUTCH-3055
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.21
>
>
> The [README.md|https://github.com/apache/nutch/blob/master/README.md] 
> contains [Github hub|https://hub.github.com/] commands but with "git" as 
> command (executable) name, maybe an alias or some other magic. However, if 
> hub isn't installed, these commands fail with {{git: 'pull-request' is not a 
> git command. See 'git --help'.}} or similar.
> We should use the command "hub" instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3055) README: fix Github "hub" commands

2024-05-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850005#comment-17850005
 ] 

ASF GitHub Bot commented on NUTCH-3055:
---

sebastian-nagel merged PR #818:
URL: https://github.com/apache/nutch/pull/818




> README: fix Github "hub" commands
> -
>
> Key: NUTCH-3055
> URL: https://issues.apache.org/jira/browse/NUTCH-3055
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.21
>
>
> The [README.md|https://github.com/apache/nutch/blob/master/README.md] 
> contains [Github hub|https://hub.github.com/] commands but with "git" as 
> command (executable) name, maybe an alias or some other magic. However, if 
> hub isn't installed, these commands fail with {{git: 'pull-request' is not a 
> git command. See 'git --help'.}} or similar.
> We should use the command "hub" instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails

2024-05-28 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3044.

Resolution: Fixed

> Generator: NPE when extracting the host part of a URL fails
> ---
>
> Key: NUTCH-3044
> URL: https://issues.apache.org/jira/browse/NUTCH-3044
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> When extracting the host part of a URL fails, the Generator job fails because 
> of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb 
> contains an malformed URL, for example, a URL with an unsupported scheme 
> (smb://).
> {noformat}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439)
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails

2024-05-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850004#comment-17850004
 ] 

ASF GitHub Bot commented on NUTCH-3044:
---

sebastian-nagel merged PR #815:
URL: https://github.com/apache/nutch/pull/815




> Generator: NPE when extracting the host part of a URL fails
> ---
>
> Key: NUTCH-3044
> URL: https://issues.apache.org/jira/browse/NUTCH-3044
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> When extracting the host part of a URL fails, the Generator job fails because 
> of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb 
> contains an malformed URL, for example, a URL with an unsupported scheme 
> (smb://).
> {noformat}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439)
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3057) Arbitrary indexer "leaks" previous value into a field processed after an exception

2024-05-18 Thread Joe Gilvary (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847521#comment-17847521
 ] 

Joe Gilvary commented on NUTCH-3057:


Happy Saturday, [~lewi...@apache.org],

I worked on the plugin and this fix with some raspberry pi hosts at home, but 
of course, found the error at work. I didn't see it until I was running with 
the 1.20 release in a pre-prod system. I set up individual POJOs for a few 
fields and added a typo in nutch-site.xml. As soon as I saw the exception 
during indexing and what made it into Solr, I knew what was wrong. A D'oh! 
moment indeed.

Let me know, please, if there's anything else I need to do, process-wise, to 
have this correct for the next distro.

> Arbitrary indexer "leaks" previous value into a field processed after an 
> exception
> --
>
> Key: NUTCH-3057
> URL: https://issues.apache.org/jira/browse/NUTCH-3057
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.20
>Reporter: Joe Gilvary
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3057) Arbitrary indexer "leaks" previous value into a field processed after an exception

2024-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847470#comment-17847470
 ] 

ASF GitHub Bot commented on NUTCH-3057:
---

lewismc commented on PR #819:
URL: https://github.com/apache/nutch/pull/819#issuecomment-2118551238

   Thanks for reporting @CatChullain i didn’t catch this edge case either when 
reviewing or testing. 
   Out curiosity what does your deployment look like? Local or deploy?




> Arbitrary indexer "leaks" previous value into a field processed after an 
> exception
> --
>
> Key: NUTCH-3057
> URL: https://issues.apache.org/jira/browse/NUTCH-3057
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.20
>Reporter: Joe Gilvary
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3057) Arbitrary indexer "leaks" previous value into a field processed after an exception

2024-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847462#comment-17847462
 ] 

ASF GitHub Bot commented on NUTCH-3057:
---

CatChullain opened a new pull request, #819:
URL: https://github.com/apache/nutch/pull/819

   Fix for NUTCH-3057 where index-arbitrary plugin retained value for a field 
and erroneously set it to the next field declared in its config stanzas




> Arbitrary indexer "leaks" previous value into a field processed after an 
> exception
> --
>
> Key: NUTCH-3057
> URL: https://issues.apache.org/jira/browse/NUTCH-3057
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.20
>Reporter: Joe Gilvary
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3057) Arbitrary indexer "leaks" previous value into a field processed after an exception

2024-05-17 Thread Joe Gilvary (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847453#comment-17847453
 ] 

Joe Gilvary commented on NUTCH-3057:


The arbitrary indexer plug-in can add multiple new fields to a doc by appending 
numeric suffixes to the config values for each. If an exception interferes with 
setting a value and there's a config for a successive field to process, the 
plug in can insert the wrong value for that successively-configured field.

> Arbitrary indexer "leaks" previous value into a field processed after an 
> exception
> --
>
> Key: NUTCH-3057
> URL: https://issues.apache.org/jira/browse/NUTCH-3057
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.20
>Reporter: Joe Gilvary
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3057) Arbitrary indexer "leaks" previous value into a field processed after an exception

2024-05-17 Thread Joe Gilvary (Jira)
Joe Gilvary created NUTCH-3057:
--

 Summary: Arbitrary indexer "leaks" previous value into a field 
processed after an exception
 Key: NUTCH-3057
 URL: https://issues.apache.org/jira/browse/NUTCH-3057
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.20
Reporter: Joe Gilvary






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs

2024-05-16 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3056:
-
Description: 
We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#00}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector. Seeds not leading to 
a non-200 URL will be discarded. Enabling filtering and normalization is highly 
recommended for handling the redirects.

If you have a seed file with 10k+ or millions of records, you are highly 
recommended to split the input file in chunks so that multiple mappers can get 
to work. Passing a few millions records without resolving through one mapper is 
no problem, but resolving millions with one mapper, even if threaded, will take 
many hours.

  was:
We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#00}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector.

If you have a seed file with 10k+ or millions of records, you are highly 
recommended to split the input file in chunks so that multiple mappers can get 
to work. Passing a few millions records without resolving through one mapper is 
no problem, but resolving millions with one mapper, even if threaded, will take 
many hours.


> Injector to support resolving seed URLs
> ---
>
> Key: NUTCH-3056
> URL: https://issues.apache.org/jira/browse/NUTCH-3056
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
>
> We have a case where clients submit huge uncurated seed files, the host may 
> not longer exist, or redirect via-via to elsewhere, the protocol may be 
> incorrect etc.
> The large crawl itself is not supposed to venture much beyond the seed list, 
> except for regex exceptions listed in 
> {color:#00}db-ignore-external-exemptions{color}. It is also not allowed 
> to jump to other domains/hosts to control the size of the crawl. This means 
> externally redirecting seeds will not be crawled.
> This ticket will add support for a multi-threaded 
> host/domain/protocol/redirecter/resolver to the injector. Seeds not leading 
> to a non-200 URL will be discarded. Enabling filtering and normalization is 
> highly recommended for handling the redirects.
> If you have a seed file with 10k+ or millions of records, you are highly 
> recommended to split the input file in chunks so that multiple mappers can 
> get to work. Passing a few millions records without resolving through one 
> mapper is no problem, but resolving millions with one mapper, even if 
> threaded, will take many hours.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs

2024-05-16 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3056:
-
Description: 
We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#00}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector.

If you have a seed file with 10k+ or millions of records, you are highly 
recommended to split the input file in chunks so that multiple mappers can get 
to work. Passing a few millions records without resolving through one mapper is 
no problem, but resolving millions with one mapper, even if threaded, will take 
many hours.

  was:
We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#00}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector.


> Injector to support resolving seed URLs
> ---
>
> Key: NUTCH-3056
> URL: https://issues.apache.org/jira/browse/NUTCH-3056
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
>
> We have a case where clients submit huge uncurated seed files, the host may 
> not longer exist, or redirect via-via to elsewhere, the protocol may be 
> incorrect etc.
> The large crawl itself is not supposed to venture much beyond the seed list, 
> except for regex exceptions listed in 
> {color:#00}db-ignore-external-exemptions{color}. It is also not allowed 
> to jump to other domains/hosts to control the size of the crawl. This means 
> externally redirecting seeds will not be crawled.
> This ticket will add support for a multi-threaded 
> host/domain/protocol/redirecter/resolver to the injector.
> If you have a seed file with 10k+ or millions of records, you are highly 
> recommended to split the input file in chunks so that multiple mappers can 
> get to work. Passing a few millions records without resolving through one 
> mapper is no problem, but resolving millions with one mapper, even if 
> threaded, will take many hours.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3056) Injector to support resolving seed URLs

2024-05-16 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-3056:


 Summary: Injector to support resolving seed URLs
 Key: NUTCH-3056
 URL: https://issues.apache.org/jira/browse/NUTCH-3056
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.21


We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#00}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-05-15 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846795#comment-17846795
 ] 

Hudson commented on NUTCH-3041:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #162 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/162/])
NUTCH-3041 Address confusing logging in o.a.n.net.URLExemptionFilters (#813) 
(github: 
[https://github.com/apache/nutch/commit/8abc78a653eb7970def10031d732fb4c7aa0fb6f])
* (edit) 
src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java
* (edit) src/java/org/apache/nutch/net/URLExemptionFilters.java
* (edit) src/plugin/urlfilter-ignoreexempt/README.md


> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-05-15 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3041.
---

> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-05-15 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3041 stopped by Lewis John McGibbney.
---
> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-05-15 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3041.
-
Resolution: Fixed

> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-05-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846788#comment-17846788
 ] 

ASF GitHub Bot commented on NUTCH-3041:
---

lewismc merged PR #813:
URL: https://github.com/apache/nutch/pull/813




> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-05-14 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846402#comment-17846402
 ] 

Hudson commented on NUTCH-3043:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #161 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/161/])
NUTCH-3043 Generator: count URLs rejected by URL filters (#814) (github: 
[https://github.com/apache/nutch/commit/5f1330a03d136440a167a85da6cfe8ac4b3f61b9])
* (edit) src/java/org/apache/nutch/crawl/Generator.java


> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3039) Failure to handle ftp:// URLs

2024-05-14 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846401#comment-17846401
 ] 

Hudson commented on NUTCH-3039:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #161 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/161/])
NUTCH-3039 Failure to handle ftp:// URLs (snagel: 
[https://github.com/apache/nutch/commit/ea9c7ee5d6635405b31b4a1d462cca746478b040])
* (edit) src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java


> Failure to handle ftp:// URLs
> -
>
> Key: NUTCH-3039
> URL: https://issues.apache.org/jira/browse/NUTCH-3039
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Nutch fails to handle ftp:// URLs:
> - URLNormalizerBasic returns the empty string because creating the URL 
> instance fails with a MalformedURLException:
>   {noformat}
> echo "ftp://ftp.example.com/path/file.txt; \
>   | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
> - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due 
> to a MalformedURLException:
>   {noformat}
> bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
>"ftp://ftp.example.com/path/file.txt;
> ...
> Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
> java.net.MalformedURLException
> at 
> org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
> ...{noformat}
> The issue is caused by NUTCH-2429:
> - we do not provide a dedicated URL stream handler for ftp URLs
> - but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-05-14 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3043.

Resolution: Implemented

> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-05-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846357#comment-17846357
 ] 

ASF GitHub Bot commented on NUTCH-3043:
---

sebastian-nagel commented on PR #814:
URL: https://github.com/apache/nutch/pull/814#issuecomment-2110558876

   Thanks, @lewismc! The metrics wiki page was updated.




> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-05-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846355#comment-17846355
 ] 

ASF GitHub Bot commented on NUTCH-3043:
---

sebastian-nagel merged PR #814:
URL: https://github.com/apache/nutch/pull/814




> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3039) Failure to handle ftp:// URLs

2024-05-14 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3039.

Resolution: Fixed

> Failure to handle ftp:// URLs
> -
>
> Key: NUTCH-3039
> URL: https://issues.apache.org/jira/browse/NUTCH-3039
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Nutch fails to handle ftp:// URLs:
> - URLNormalizerBasic returns the empty string because creating the URL 
> instance fails with a MalformedURLException:
>   {noformat}
> echo "ftp://ftp.example.com/path/file.txt; \
>   | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
> - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due 
> to a MalformedURLException:
>   {noformat}
> bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
>"ftp://ftp.example.com/path/file.txt;
> ...
> Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
> java.net.MalformedURLException
> at 
> org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
> ...{noformat}
> The issue is caused by NUTCH-2429:
> - we do not provide a dedicated URL stream handler for ftp URLs
> - but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3039) Failure to handle ftp:// URLs

2024-05-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846345#comment-17846345
 ] 

ASF GitHub Bot commented on NUTCH-3039:
---

sebastian-nagel merged PR #812:
URL: https://github.com/apache/nutch/pull/812




> Failure to handle ftp:// URLs
> -
>
> Key: NUTCH-3039
> URL: https://issues.apache.org/jira/browse/NUTCH-3039
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Nutch fails to handle ftp:// URLs:
> - URLNormalizerBasic returns the empty string because creating the URL 
> instance fails with a MalformedURLException:
>   {noformat}
> echo "ftp://ftp.example.com/path/file.txt; \
>   | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
> - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due 
> to a MalformedURLException:
>   {noformat}
> bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
>"ftp://ftp.example.com/path/file.txt;
> ...
> Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
> java.net.MalformedURLException
> at 
> org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
> ...{noformat}
> The issue is caused by NUTCH-2429:
> - we do not provide a dedicated URL stream handler for ftp URLs
> - but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2024-04-30 Thread Joe Gilvary (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842526#comment-17842526
 ] 

Joe Gilvary commented on NUTCH-585:
---

[~dbeckstrom] I'm not sure which patch you were asking about. I used the source 
for the new 1.20 release and applied the patch that [~ad-...@gmx.at] posted 
after an edit to the line numbers for the update to src/plugin/build.xml. It 
built cleanly and seems to work exactly as advertised in my tests with 
indexchecker.

> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> ---
>
> Key: NUTCH-585
> URL: https://issues.apache.org/jira/browse/NUTCH-585
> Project: Nutch
>  Issue Type: Improvement
>  Components: HTML, parse-filter, parser, plugin
>Affects Versions: 0.9.0
> Environment: All operating systems
>Reporter: Andrea Spinelli
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
> Attachments: blacklist_whitelist_plugin.patch, 
> nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch
>
>
> We are using nutch to index our own web sites; we would like not to index 
> certain parts of our pages, because we know they are not relevant (for 
> instance, there are several links to change the background color) and 
> generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML 
> comments, like
> 
> ... ignored part ...
> 
> We feel this might be useful to someone else, maybe factorizing the comment 
> strings as constants in the configuration files (say parser.html.ignore.start 
> and parser.html.ignore.stop in nutch-site.xml).
> We are almost ready to contribute our code snippet.  Looking forward for any 
> expression of  interest - or for an explanation why waht we are doing is 
> plain wrong!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-30 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842426#comment-17842426
 ] 

Hudson commented on NUTCH-3054:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #160 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/160/])
NUTCH-3054 Address deprecation of Node16 for all GitHub Actions (#817) (github: 
[https://github.com/apache/nutch/commit/7ac3ce28e065fb5160f96ce7bce1ec840f87d0dc])
* (edit) .github/workflows/master-build.yml


> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3054.
---

> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3054.
-
Resolution: Fixed

> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842410#comment-17842410
 ] 

ASF GitHub Bot commented on NUTCH-3054:
---

lewismc merged PR #817:
URL: https://github.com/apache/nutch/pull/817




> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842384#comment-17842384
 ] 

Markus Jelsma commented on NUTCH-3028:
--

Ok, the Content object is now also available in the evaluation. I added an 
example of it to the description above.

 

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028-2.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}
> {color:#00}or {color}
> {color:#00}-expr 
> 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Attachment: NUTCH-3028-2.patch

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028-2.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}
> {color:#00}or {color}
> {color:#00}-expr 
> 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Description: 
Filtering segment data to WARC is now possible using JEXL expressions. In the 
next example, all records with SOME_KEY=SOME_VALUE in their parseData metadata 
are exported to WARC.

{color:#00}-expr 
'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}

{color:#00}or {color}

{color:#00}-expr 
'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color}

  was:
Filtering segment data to WARC is now possible using JEXL expressions. In the 
next example, all records with SOME_KEY=SOME_VALUE in their parseData metadata 
are exported to WARC.

{color:#00}-expr 
'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}


> WARCExported to support filtering by JEXL
> -
>
>     Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}
> {color:#00}or {color}
> {color:#00}-expr 
> 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3055) README: fix Github "hub" commands

2024-04-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842308#comment-17842308
 ] 

ASF GitHub Bot commented on NUTCH-3055:
---

sebastian-nagel opened a new pull request, #818:
URL: https://github.com/apache/nutch/pull/818

   (no comment)




> README: fix Github "hub" commands
> -
>
> Key: NUTCH-3055
> URL: https://issues.apache.org/jira/browse/NUTCH-3055
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.21
>
>
> The [README.md|https://github.com/apache/nutch/blob/master/README.md] 
> contains [Github hub|https://hub.github.com/] commands but with "git" as 
> command (executable) name, maybe an alias or some other magic. However, if 
> hub isn't installed, these commands fail with {{git: 'pull-request' is not a 
> git command. See 'git --help'.}} or similar.
> We should use the command "hub" instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3055) README: fix Github "hub" commands

2024-04-30 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3055:
--

 Summary: README: fix Github "hub" commands
 Key: NUTCH-3055
 URL: https://issues.apache.org/jira/browse/NUTCH-3055
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.20
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21


The [README.md|https://github.com/apache/nutch/blob/master/README.md] contains 
[Github hub|https://hub.github.com/] commands but with "git" as command 
(executable) name, maybe an alias or some other magic. However, if hub isn't 
installed, these commands fail with {{git: 'pull-request' is not a git command. 
See 'git --help'.}} or similar.

We should use the command "hub" instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842291#comment-17842291
 ] 

Sebastian Nagel commented on NUTCH-3028:


+1 lgtm.

One question: if there is no parseData, the JEXL expression is not evaluated. 
Since WARC files may inlcude only the raw HTML plus fetch/capture metadata, 
successfully parsing a document is not a requirement to archive it in a WARC 
file. Might be useful to have the JEXL filtering also available for unparsed 
docs.

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3045) Upgrade from Java 11 to 17

2024-04-30 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842284#comment-17842284
 ] 

Sebastian Nagel commented on NUTCH-3045:


See also NUTCH-2987. Until HADOOP-17177 / HADOOP-18887 are done, we might be 
forced to upkeep JDK 11 runtime compatibility, so that Nutch runs on recent 
Hadoop versions and distributions. I fully agree that Java 17 offers some nice 
syntax improvements, though. :)

> Upgrade from Java 11 to 17
> --
>
> Key: NUTCH-3045
> URL: https://issues.apache.org/jira/browse/NUTCH-3045
> Project: Nutch
>  Issue Type: Task
>  Components: build, ci/cd
>Reporter: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.21
>
>
> This parent issue will track and organize work pertaining to upgrading Nutch 
> to JDK 17.
> Premier support for Oracle JDK 11 ended 7 months ago (30 Sep 2023).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842209#comment-17842209
 ] 

ASF GitHub Bot commented on NUTCH-3054:
---

lewismc opened a new pull request, #817:
URL: https://github.com/apache/nutch/pull/817

   Addresses https://issues.apache.org/jira/browse/NUTCH-3054




> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-29 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3054:

Affects Version/s: 1.20

> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3054:
---

 Summary: Address deprecation of Node16 for all GitHub Actions
 Key: NUTCH-3054
 URL: https://issues.apache.org/jira/browse/NUTCH-3054
 Project: Nutch
  Issue Type: Task
  Components: ci/cd
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.21


See 
[https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]

We need to upgrade the setup-java action in  
[https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
 

Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-29 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3054 started by Lewis John McGibbney.
---
> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3049) Investigate using Records

2024-04-29 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842208#comment-17842208
 ] 

Lewis John McGibbney commented on NUTCH-3049:
-

I think that each of the Writable classes mentioned in NutchWritable may be 
fair game

{{        org.apache.nutch.crawl.CrawlDatum.class,}}
{{        org.apache.nutch.crawl.Inlink.class,}}
{{        org.apache.nutch.crawl.Inlinks.class,}}
{{        org.apache.nutch.indexer.NutchIndexAction.class,}}
{{        org.apache.nutch.metadata.Metadata.class,}}
{{        org.apache.nutch.parse.Outlink.class,}}
{{        org.apache.nutch.parse.ParseText.class,}}
{{        org.apache.nutch.parse.ParseData.class,}}
{{        org.apache.nutch.parse.ParseImpl.class,}}
{{        org.apache.nutch.parse.ParseStatus.class,}}
{{        org.apache.nutch.protocol.Content.class,}}
{{        org.apache.nutch.protocol.ProtocolStatus.class,}}
{{        org.apache.nutch.scoring.webgraph.LinkDatum.class,}}
{{        org.apache.nutch.hostdb.HostDatum.class}}

> Investigate using Records
> -
>
> Key: NUTCH-3049
> URL: https://issues.apache.org/jira/browse/NUTCH-3049
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Guidance at [https://www.baeldung.com/java-migrate-8-to-17#records]
> i think there are multiple areas where we could use Records. This ticket will 
> document the opportunities and structure that work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3053) Upgrade build and CI to JDK17

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3053:
---

 Summary: Upgrade build and CI to JDK17
 Key: NUTCH-3053
 URL: https://issues.apache.org/jira/browse/NUTCH-3053
 Project: Nutch
  Issue Type: Sub-task
  Components: build, ci/cd
Reporter: Lewis John McGibbney


This will involves changes to
 * 
[https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
 * [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/]
 * [https://github.com/apache/nutch/blob/master/default.properties#L46]
 * [https://github.com/apache/nutch/blob/master/default.properties#L57]
 * We should also investigate any deprecation notices in the build output
 * [https://github.com/apache/nutch/blob/master/ivy/mvn.template#L128-L129]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3052) Investigate using sealed classes

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3052:
---

 Summary: Investigate using sealed classes
 Key: NUTCH-3052
 URL: https://issues.apache.org/jira/browse/NUTCH-3052
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance available at 
[https://www.baeldung.com/java-migrate-8-to-17#sealed-classes]

First document if and where sealed classes would add value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3051) Investigate using new pattern matching syntax in switch expressions

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3051:
---

 Summary: Investigate using new pattern matching syntax in switch 
expressions
 Key: NUTCH-3051
 URL: https://issues.apache.org/jira/browse/NUTCH-3051
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance available at 
[https://www.baeldung.com/java-migrate-8-to-17#2-switch-expressions]

Apparently we use switch in 35 files

[https://github.com/search?q=repo%3Aapache%2Fnutch+switch+language%3AJava=code=Java]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3050) Investigate use of the enhanced instanceof operator

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3050:
---

 Summary: Investigate use of the enhanced instanceof operator
 Key: NUTCH-3050
 URL: https://issues.apache.org/jira/browse/NUTCH-3050
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-enhanced-instanceof-operator]

Apparently we use instanceof operator in 50 files

[https://github.com/search?q=repo%3Aapache%2Fnutch%20instanceof=code]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3049) Investigate using Records

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3049:
---

 Summary: Investigate using Records
 Key: NUTCH-3049
 URL: https://issues.apache.org/jira/browse/NUTCH-3049
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance at [https://www.baeldung.com/java-migrate-8-to-17#records]

i think there are multiple areas where we could use Records. This ticket will 
document the opportunities and structure that work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3048) Investigate where/if new string utility methods could be used

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3048:
---

 Summary: Investigate where/if new string utility methods could be 
used
 Key: NUTCH-3048
 URL: https://issues.apache.org/jira/browse/NUTCH-3048
 Project: Nutch
  Issue Type: Sub-task
  Components: util
Reporter: Lewis John McGibbney


Guidance at [https://www.baeldung.com/java-migrate-8-to-17#3-new-string-methods]

We may be able to also revisit our usage of common-* libraries with tje goal of 
using native methods from JDK.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3047) Use multi-line text blocks

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3047:
---

 Summary: Use multi-line text blocks
 Key: NUTCH-3047
 URL: https://issues.apache.org/jira/browse/NUTCH-3047
 Project: Nutch
  Issue Type: Sub-task
  Components: CLI
Reporter: Lewis John McGibbney


Guidance available at 
[https://www.baeldung.com/java-migrate-8-to-17#2-text-block]

This will help to cleanup our CLI *usage()* messages at a bare minimum.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3046) Use compact strings

2024-04-29 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3046:

Description: 
Follow the guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]

It looks like there are 9 instances where we use _*char []*_

|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].

  was:
Follow the guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]

It looks like there are [9 instances where we use 
char[]|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].


> Use compact strings
> ---
>
> Key: NUTCH-3046
> URL: https://issues.apache.org/jira/browse/NUTCH-3046
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Follow the guidance at 
> [https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]
> It looks like there are 9 instances where we use _*char []*_
> |[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-1806) Delegate processing of URL domains to crawler commons

2024-04-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841995#comment-17841995
 ] 

ASF GitHub Bot commented on NUTCH-1806:
---

sebastian-nagel opened a new pull request, #816:
URL: https://github.com/apache/nutch/pull/816

   and NUTCH-1942 Remove TopLevelDomain
   
   - use methods from crawler-commons' EffectiveTldFinder in URLUtil  replacing 
classed and methods from the "org.apache.nutch.util.domain" package
   
   - adapt and extend unit tests
 - add tests for URLUtil.getTopLevelDomainName(url)
 - reflect changes to the public suffix list since 2014 ("xyz" is now a 
public suffix / ICANN suffix)
 - adapt to minor API changes
- URLUtil.getDomainName(url) returns the host name in case no valid 
public suffix is found
- for Unicode suffixes and TLDs the methods 
URLUtil.getDomainSuffix(url) resp.  URLUtil.getTopLevelDomainName(url) now 
return the ASCII representation
  - add unit tests for host names with trailing dot ("www.apache.org.")
  - add add unit test for URLs without host/domain (cf. NUTCH-2450)unit 
test for URLs without host/domain (cf. NUTCH-2450)
   
   - update and complete Javadoc
   
   - update DomainStatistics, TLDIndexingFilter and domain URL filters to use 
the updated methods in URLUtil
   - remove the class TLDScoringFilter. The configuration is bound to the 
domain-suffixes.xml which wasn't maintained anymore and is now removed
   - remove package org.apache.nutch.util.domain
   - move DomainStatistics to org.apache.nutch.util
   - remove configuration files of domain utils




> Delegate processing of URL domains to crawler commons
> -
>
> Key: NUTCH-1806
>     URL: https://issues.apache.org/jira/browse/NUTCH-1806
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.8
>Reporter: Julien Nioche
>Priority: Major
>  Labels: crawler-commons
> Fix For: 1.21
>
>
> We have code in src/java/org/apache/nutch/util/domain and a resource file 
> conf/domain-suffixes.xml to handle URL domains. This is used mostly from 
> URLUtil.getDomainName.
> The resource file is not necessarily up to date and since crawler commons has 
> a similar functionality we should use it instead of having to maintain our 
> own resources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3046) Use compact strings

2024-04-28 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3046:
---

 Summary: Use compact strings
 Key: NUTCH-3046
 URL: https://issues.apache.org/jira/browse/NUTCH-3046
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Follow the guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]

It looks like there are [9 instances where we use 
char[]|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3045) Upgrade from Java 11 to 17

2024-04-28 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3045:
---

 Summary: Upgrade from Java 11 to 17
 Key: NUTCH-3045
 URL: https://issues.apache.org/jira/browse/NUTCH-3045
 Project: Nutch
  Issue Type: Task
  Components: build, ci/cd
Reporter: Lewis John McGibbney
 Fix For: 1.21


This parent issue will track and organize work pertaining to upgrading Nutch to 
JDK 17.

Premier support for Oracle JDK 11 ended 7 months ago (30 Sep 2023).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails

2024-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841682#comment-17841682
 ] 

ASF GitHub Bot commented on NUTCH-3044:
---

lewismc commented on PR #815:
URL: https://github.com/apache/nutch/pull/815#issuecomment-2081564107

   Excellent @sebastian-nagel +1




> Generator: NPE when extracting the host part of a URL fails
> ---
>
> Key: NUTCH-3044
> URL: https://issues.apache.org/jira/browse/NUTCH-3044
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> When extracting the host part of a URL fails, the Generator job fails because 
> of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb 
> contains an malformed URL, for example, a URL with an unsupported scheme 
> (smb://).
> {noformat}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439)
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841681#comment-17841681
 ] 

ASF GitHub Bot commented on NUTCH-3043:
---

lewismc commented on PR #814:
URL: https://github.com/apache/nutch/pull/814#issuecomment-2081563229

   Excellent @sebastian-nagel 




> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails

2024-04-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841481#comment-17841481
 ] 

ASF GitHub Bot commented on NUTCH-3044:
---

sebastian-nagel commented on PR #815:
URL: https://github.com/apache/nutch/pull/815#issuecomment-2080743831

   ... also fixed the Javadoc error.




> Generator: NPE when extracting the host part of a URL fails
> ---
>
> Key: NUTCH-3044
> URL: https://issues.apache.org/jira/browse/NUTCH-3044
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> When extracting the host part of a URL fails, the Generator job fails because 
> of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb 
> contains an malformed URL, for example, a URL with an unsupported scheme 
> (smb://).
> {noformat}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439)
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-04-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841472#comment-17841472
 ] 

ASF GitHub Bot commented on NUTCH-3043:
---

sebastian-nagel commented on PR #814:
URL: https://github.com/apache/nutch/pull/814#issuecomment-2080634329

   Hi @lewismc:
   - "use parameterized logging": done
   - "augment the [metrics 
documentation](https://cwiki.apache.org/confluence/display/NUTCH/Metrics) once 
this is merged.": will do
   - "we could also [create a test for the 
counters](https://cwiki.apache.org/confluence/display/MRUNIT/MRUnit+Tutorial#MRUnitTutorial-TestingCounters).":
 for now, TestGenerator is not based on MRUNIT. The various 
Generator::generate(...) return the number of generated segments without a way 
to access the counters (they're logged, however). I'd prefer to track this in a 
separate issue, because it would require to many code changes to read the 
counters.




> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
>     URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails

2024-04-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841470#comment-17841470
 ] 

ASF GitHub Bot commented on NUTCH-3044:
---

sebastian-nagel commented on PR #815:
URL: https://github.com/apache/nutch/pull/815#issuecomment-2080603546

   > we could provide a TestGenerator#testNullHostInReducer test case
   
   Good idea! Done, see 4729786.




> Generator: NPE when extracting the host part of a URL fails
> ---
>
> Key: NUTCH-3044
> URL: https://issues.apache.org/jira/browse/NUTCH-3044
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> When extracting the host part of a URL fails, the Generator job fails because 
> of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb 
> contains an malformed URL, for example, a URL with an unsupported scheme 
> (smb://).
> {noformat}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439)
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840892#comment-17840892
 ] 

ASF GitHub Bot commented on NUTCH-3043:
---

lewismc commented on code in PR #814:
URL: https://github.com/apache/nutch/pull/814#discussion_r1579883313


##
src/java/org/apache/nutch/crawl/Generator.java:
##
@@ -253,10 +256,7 @@ public void map(Text key, CrawlDatum value, Context 
context)
   try {
 sort = scfilters.generatorSortValue(key, crawlDatum, sort);
   } catch (ScoringFilterException sfe) {
-if (LOG.isWarnEnabled()) {
-  LOG.warn(
-  "Couldn't filter generatorSortValue for " + key + ": " + sfe);
-}
+LOG.warn("Couldn't filter generatorSortValue for " + key + ": " + sfe);

Review Comment:
   Please use parameterized logging.
   ```
   LOG.warn("Couldn't filter generatorSortValue for {}: {}”, key, sfe);
   ```





> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails

2024-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840854#comment-17840854
 ] 

ASF GitHub Bot commented on NUTCH-3044:
---

sebastian-nagel opened a new pull request, #815:
URL: https://github.com/apache/nutch/pull/815

   (no comment)




> Generator: NPE when extracting the host part of a URL fails
> ---
>
> Key: NUTCH-3044
> URL: https://issues.apache.org/jira/browse/NUTCH-3044
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> When extracting the host part of a URL fails, the Generator job fails because 
> of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb 
> contains an malformed URL, for example, a URL with an unsupported scheme 
> (smb://).
> {noformat}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439)
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails

2024-04-25 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3044:
--

 Summary: Generator: NPE when extracting the host part of a URL 
fails
 Key: NUTCH-3044
 URL: https://issues.apache.org/jira/browse/NUTCH-3044
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.21


When extracting the host part of a URL fails, the Generator job fails because 
of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb 
contains an malformed URL, for example, a URL with an unsupported scheme 
(smb://).

{noformat}
Caused by: java.lang.NullPointerException
  at org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439)
  at org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300)
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840845#comment-17840845
 ] 

ASF GitHub Bot commented on NUTCH-3043:
---

sebastian-nagel opened a new pull request, #814:
URL: https://github.com/apache/nutch/pull/814

   - add counters URL_FILTERS_REJECTED and URL_FILTER_EXCEPTION
   - simplify logging statement
   - remove unnecessary cast




> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-04-25 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3043:
--

 Summary: Generator: count URLs rejected by URL filters
 Key: NUTCH-3043
 URL: https://issues.apache.org/jira/browse/NUTCH-3043
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.20
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21


Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
interval or status. It should also count the number of URLs rejected by URL 
filters.

See also [Generator 
metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839186#comment-17839186
 ] 

ASF GitHub Bot commented on NUTCH-3041:
---

lewismc commented on PR #813:
URL: https://github.com/apache/nutch/pull/813#issuecomment-2067543713

   The logging now looks as follows
   ```INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] 
Found 1 URLExemptionFilter implementations: 
'[org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter@3090c372]’```.
   If no URLExemptionFilter implementations are found then no log statement is 
produced. 




> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3042) Use GitHub cache action to improve CI execution time

2024-04-19 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3042:

Description: 
With the Ant+Ivy build architecture, the current GitHub actions workflow can 
and regularly does take over 20 minutes to complete. Dependency retrieval takes 
a significant amount of time.

I think we can address the above issue and dramatically reduce the CI runtime 
by utilizing the official [GitHiub cache 
action|[https://github.com/actions/cache]].

It appears however that the action does not support the Apache Ivy cache. Both 
Maven and Gradle are supported. I [created a 
discussion|[https://github.com/actions/cache/discussions/1381]] to get 
conformation.

In the case that we cannot implement a cache for the Ivy build system then we 
will need to come back to this issue once we migrate to Gradle.

  was:
With the Ant+Ivy build architecture, the current GitHub actions workflow can 
and regularly does take over 20 minutes to complete. Dependency retrieval takes 
a significant amount of time.

I think we can address the above issue and dramatically reduce the CI runtime 
by utilizing the official [GitHiub cache 
action|[https://github.com/actions/cache]].

It appears however that the action does not support the Apache Ivy cache. Both 
Maven and Gradle are supported. I created a discussion to get conformation if 
this is the case.

In the case that we cannot implement a cache for the Ivy build system then we 
will need to come back to this issue once we migrate to Gradle.


> Use GitHub cache action to improve CI execution time
> 
>
> Key: NUTCH-3042
> URL: https://issues.apache.org/jira/browse/NUTCH-3042
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.21
>
>
> With the Ant+Ivy build architecture, the current GitHub actions workflow can 
> and regularly does take over 20 minutes to complete. Dependency retrieval 
> takes a significant amount of time.
> I think we can address the above issue and dramatically reduce the CI runtime 
> by utilizing the official [GitHiub cache 
> action|[https://github.com/actions/cache]].
> It appears however that the action does not support the Apache Ivy cache. 
> Both Maven and Gradle are supported. I [created a 
> discussion|[https://github.com/actions/cache/discussions/1381]] to get 
> conformation.
> In the case that we cannot implement a cache for the Ivy build system then we 
> will need to come back to this issue once we migrate to Gradle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3042) Use GitHub cache action to improve CI execution time

2024-04-19 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3042:
---

 Summary: Use GitHub cache action to improve CI execution time
 Key: NUTCH-3042
 URL: https://issues.apache.org/jira/browse/NUTCH-3042
 Project: Nutch
  Issue Type: Task
  Components: ci/cd
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.21


With the Ant+Ivy build architecture, the current GitHub actions workflow can 
and regularly does take over 20 minutes to complete. Dependency retrieval takes 
a significant amount of time.

I think we can address the above issue and dramatically reduce the CI runtime 
by utilizing the official [GitHiub cache 
action|[https://github.com/actions/cache]].

It appears however that the action does not support the Apache Ivy cache. Both 
Maven and Gradle are supported. I created a discussion to get conformation if 
this is the case.

In the case that we cannot implement a cache for the Ivy build system then we 
will need to come back to this issue once we migrate to Gradle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3041 started by Lewis John McGibbney.
---
> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839181#comment-17839181
 ] 

ASF GitHub Bot commented on NUTCH-3041:
---

lewismc opened a new pull request, #813:
URL: https://github.com/apache/nutch/pull/813

   PR to address https://issues.apache.org/jira/browse/NUTCH-3041




> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3041:

Description: 
URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently [URLExemptionFilters|#L47-L48]] provides the following logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
#0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation is actually configured to be used at runtime.

I will provide a patch for this.

  was:
URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently [URLExemptionFilters|#L47-L48]] provides the following logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
#0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation actually exists for a given URL.

I will provide a patch for this.


> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3041:

Description: 
URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently [URLExemptionFilters|#L47-L48]] provides the following logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
#0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation actually exists for a given URL.

I will provide a patch for this.

  was:
URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently 
[URLExemptionFilters|[https://github.com/apache/nutch/blob/271f92e11c39b7a3583cfcd8d664262cfac59674/src/java/org/apache/nutch/net/URLExemptionFilters.java#L47-L48]]
 provides some confusing INFO-level logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] 
Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation actually exists for a given URL.

I will provide a patch for this.


> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation actually exists for a given URL.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3041:
---

 Summary: Address confusing logging in 
o.a.n.net.URLExemptionFilters 
 Key: NUTCH-3041
 URL: https://issues.apache.org/jira/browse/NUTCH-3041
 Project: Nutch
  Issue Type: Task
  Components: net
Affects Versions: 1.19, 1.20
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.21


URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently 
[URLExemptionFilters|[https://github.com/apache/nutch/blob/271f92e11c39b7a3583cfcd8d664262cfac59674/src/java/org/apache/nutch/net/URLExemptionFilters.java#L47-L48]]
 provides some confusing INFO-level logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] 
Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation actually exists for a given URL.

I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3040) Upgrade to Hadoop 3.4.0

2024-04-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836191#comment-17836191
 ] 

Tim Allison commented on NUTCH-3040:


:cry-sob: This is great news!

> Upgrade to Hadoop 3.4.0
> ---
>
> Key: NUTCH-3040
> URL: https://issues.apache.org/jira/browse/NUTCH-3040
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> [Hadoop 3.4.0|https://hadoop.apache.org/release/3.4.0.html] has been released.
> Many dependencies are upgraded, including commons-io 2.14.0 which would have 
> saved us a lot of work in NUTCH-2959.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3040) Upgrade to Hadoop 3.4.0

2024-04-11 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3040:
--

 Summary: Upgrade to Hadoop 3.4.0
 Key: NUTCH-3040
 URL: https://issues.apache.org/jira/browse/NUTCH-3040
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.21


[Hadoop 3.4.0|https://hadoop.apache.org/release/3.4.0.html] has been released.

Many dependencies are upgraded, including commons-io 2.14.0 which would have 
saved us a lot of work in NUTCH-2959.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836133#comment-17836133
 ] 

Markus Jelsma commented on NUTCH-3039:
--

Thanks for spotting that!

> Failure to handle ftp:// URLs
> -
>
> Key: NUTCH-3039
> URL: https://issues.apache.org/jira/browse/NUTCH-3039
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Nutch fails to handle ftp:// URLs:
> - URLNormalizerBasic returns the empty string because creating the URL 
> instance fails with a MalformedURLException:
>   {noformat}
> echo "ftp://ftp.example.com/path/file.txt; \
>   | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
> - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due 
> to a MalformedURLException:
>   {noformat}
> bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
>"ftp://ftp.example.com/path/file.txt;
> ...
> Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
> java.net.MalformedURLException
> at 
> org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
> ...{noformat}
> The issue is caused by NUTCH-2429:
> - we do not provide a dedicated URL stream handler for ftp URLs
> - but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836126#comment-17836126
 ] 

ASF GitHub Bot commented on NUTCH-3039:
---

sebastian-nagel opened a new pull request, #812:
URL: https://github.com/apache/nutch/pull/812

   Pass ftp:// URLs to the standard JVM URLStreamHandler




> Failure to handle ftp:// URLs
> -
>
> Key: NUTCH-3039
> URL: https://issues.apache.org/jira/browse/NUTCH-3039
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Nutch fails to handle ftp:// URLs:
> - URLNormalizerBasic returns the empty string because creating the URL 
> instance fails with a MalformedURLException:
>   {noformat}
> echo "ftp://ftp.example.com/path/file.txt; \
>   | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
> - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due 
> to a MalformedURLException:
>   {noformat}
> bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
>"ftp://ftp.example.com/path/file.txt;
> ...
> Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
> java.net.MalformedURLException
> at 
> org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
> ...{noformat}
> The issue is caused by NUTCH-2429:
> - we do not provide a dedicated URL stream handler for ftp URLs
> - but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-3039:
--

Assignee: Sebastian Nagel

> Failure to handle ftp:// URLs
> -
>
> Key: NUTCH-3039
> URL: https://issues.apache.org/jira/browse/NUTCH-3039
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Nutch fails to handle ftp:// URLs:
> - URLNormalizerBasic returns the empty string because creating the URL 
> instance fails with a MalformedURLException:
>   {noformat}
> echo "ftp://ftp.example.com/path/file.txt; \
>   | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
> - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due 
> to a MalformedURLException:
>   {noformat}
> bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
>"ftp://ftp.example.com/path/file.txt;
> ...
> Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
> java.net.MalformedURLException
> at 
> org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
> ...{noformat}
> The issue is caused by NUTCH-2429:
> - we do not provide a dedicated URL stream handler for ftp URLs
> - but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3039:
--

 Summary: Failure to handle ftp:// URLs
 Key: NUTCH-3039
 URL: https://issues.apache.org/jira/browse/NUTCH-3039
 Project: Nutch
  Issue Type: Bug
  Components: plugin, protocol
Affects Versions: 1.19
Reporter: Sebastian Nagel
 Fix For: 1.21


Nutch fails to handle ftp:// URLs:
- URLNormalizerBasic returns the empty string because creating the URL instance 
fails with a MalformedURLException:
  {noformat}
echo "ftp://ftp.example.com/path/file.txt; \
  | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
- fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due to 
a MalformedURLException:
  {noformat}
bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
   "ftp://ftp.example.com/path/file.txt;
...
Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
java.net.MalformedURLException
at 
org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
...{noformat}


The issue is caused by NUTCH-2429:
- we do not provide a dedicated URL stream handler for ftp URLs
- but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-08 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835083#comment-17835083
 ] 

Hudson commented on NUTCH-3038:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #157 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/157/])
NUTCH-3038 Address issues discovered during 1.20 release management dryrun 
(#811) (github: 
[https://github.com/apache/nutch/commit/271f92e11c39b7a3583cfcd8d664262cfac59674])
* (edit) ivy/mvn.template
* (add) CHANGES.md
* (delete) CHANGES.txt
* (edit) build.xml
* (edit) docker/Dockerfile
* (edit) docker/README.md


> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835078#comment-17835078
 ] 

ASF GitHub Bot commented on NUTCH-3038:
---

lewismc merged PR #811:
URL: https://github.com/apache/nutch/pull/811




> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-08 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3038.
-
Resolution: Fixed

> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-08 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3038.
---

Thanks [~snagel] 

> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-08 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3038 stopped by Lewis John McGibbney.
---
> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834532#comment-17834532
 ] 

Tim Allison commented on NUTCH-2937:


I really, really, really wish we didn't have to do this! :P

Happy to help!

> parse-tika: review dependency exclusions and avoid dependency conflicts in 
> distributed mode
> ---
>
> Key: NUTCH-2937
> URL: https://issues.apache.org/jira/browse/NUTCH-2937
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.20
>
>
> While testing NUTCH-2919 I've seen the following error caused by a 
> conflicting dependency to commons-io:
> - 2.11.0 Nutch core
> - 2.11.0 parse-tika (excluded to avoid duplicated dependencies)
> - 2.5 provided by Hadoop
> This causes errors parsing some office and other documents (but not all), for 
> example:
> {noformat}
> 2022-01-15 01:36:31,365 WARN [FetcherThread] 
> org.apache.nutch.parse.ParseUtil: Error parsing 
> http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431)
> Caused by: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2937.

Resolution: Fixed

Fixed NUTCH-2959 by using the shaded Tika package. Thanks, [~tallison]!

> parse-tika: review dependency exclusions and avoid dependency conflicts in 
> distributed mode
> ---
>
> Key: NUTCH-2937
> URL: https://issues.apache.org/jira/browse/NUTCH-2937
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> While testing NUTCH-2919 I've seen the following error caused by a 
> conflicting dependency to commons-io:
> - 2.11.0 Nutch core
> - 2.11.0 parse-tika (excluded to avoid duplicated dependencies)
> - 2.5 provided by Hadoop
> This causes errors parsing some office and other documents (but not all), for 
> example:
> {noformat}
> 2022-01-15 01:36:31,365 WARN [FetcherThread] 
> org.apache.nutch.parse.ParseUtil: Error parsing 
> http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431)
> Caused by: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2937:
--

Assignee: Tim Allison

> parse-tika: review dependency exclusions and avoid dependency conflicts in 
> distributed mode
> ---
>
> Key: NUTCH-2937
> URL: https://issues.apache.org/jira/browse/NUTCH-2937
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.20
>
>
> While testing NUTCH-2919 I've seen the following error caused by a 
> conflicting dependency to commons-io:
> - 2.11.0 Nutch core
> - 2.11.0 parse-tika (excluded to avoid duplicated dependencies)
> - 2.5 provided by Hadoop
> This causes errors parsing some office and other documents (but not all), for 
> example:
> {noformat}
> 2022-01-15 01:36:31,365 WARN [FetcherThread] 
> org.apache.nutch.parse.ParseUtil: Error parsing 
> http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431)
> Caused by: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2937:
---
Fix Version/s: 1.20
   (was: 1.21)

> parse-tika: review dependency exclusions and avoid dependency conflicts in 
> distributed mode
> ---
>
> Key: NUTCH-2937
> URL: https://issues.apache.org/jira/browse/NUTCH-2937
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> While testing NUTCH-2919 I've seen the following error caused by a 
> conflicting dependency to commons-io:
> - 2.11.0 Nutch core
> - 2.11.0 parse-tika (excluded to avoid duplicated dependencies)
> - 2.5 provided by Hadoop
> This causes errors parsing some office and other documents (but not all), for 
> example:
> {noformat}
> 2022-01-15 01:36:31,365 WARN [FetcherThread] 
> org.apache.nutch.parse.ParseUtil: Error parsing 
> http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431)
> Caused by: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3005) Upgrade selenium as needed

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3005.

Resolution: Implemented

Done by [~lewismc] as part of NUTCH-3036, commit 
[1563396|https://github.com/apache/nutch/blob/1563396d952393462fffab1f686e9ffd5d006cf6/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java#L151]
 .

> Upgrade selenium as needed
> --
>
> Key: NUTCH-3005
> URL: https://issues.apache.org/jira/browse/NUTCH-3005
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.20
>
>
> When we choose to upgrade selenium, we should take note of this blog about 
> changes in headless chromium: 
> https://www.selenium.dev/blog/2023/headless-is-going-away/
> ChromeOptions options = new ChromeOptions();
> options.addArguments("--headless=new");
> WebDriver driver = new ChromeDriver(options);
> driver.get("https://selenium.dev;);
> driver.quit();



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3016.

Resolution: Duplicate

> Upgrade Apache Ivy to 2.5.2
> ---
>
> Key: NUTCH-3016
> URL: https://issues.apache.org/jira/browse/NUTCH-3016
> Project: Nutch
>  Issue Type: Task
>  Components: build, ivy
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> [Apache Ivy 
> v2.5.2|https://ant.apache.org/ivy/history/2.5.2/release-notes.html] was 
> released on August 20 2023!
> We should upgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3016:
---
Fix Version/s: 1.20
   (was: 1.21)

> Upgrade Apache Ivy to 2.5.2
> ---
>
> Key: NUTCH-3016
> URL: https://issues.apache.org/jira/browse/NUTCH-3016
> Project: Nutch
>  Issue Type: Task
>  Components: build, ivy
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> [Apache Ivy 
> v2.5.2|https://ant.apache.org/ivy/history/2.5.2/release-notes.html] was 
> released on August 20 2023!
> We should upgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3005) Upgrade selenium as needed

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3005:
---
Affects Version/s: 1.19

> Upgrade selenium as needed
> --
>
> Key: NUTCH-3005
> URL: https://issues.apache.org/jira/browse/NUTCH-3005
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
>
> When we choose to upgrade selenium, we should take note of this blog about 
> changes in headless chromium: 
> https://www.selenium.dev/blog/2023/headless-is-going-away/
> ChromeOptions options = new ChromeOptions();
> options.addArguments("--headless=new");
> WebDriver driver = new ChromeDriver(options);
> driver.get("https://selenium.dev;);
> driver.quit();



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3005) Upgrade selenium as needed

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3005:
---
Fix Version/s: 1.20

> Upgrade selenium as needed
> --
>
> Key: NUTCH-3005
> URL: https://issues.apache.org/jira/browse/NUTCH-3005
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.20
>
>
> When we choose to upgrade selenium, we should take note of this blog about 
> changes in headless chromium: 
> https://www.selenium.dev/blog/2023/headless-is-going-away/
> ChromeOptions options = new ChromeOptions();
> options.addArguments("--headless=new");
> WebDriver driver = new ChromeDriver(options);
> driver.get("https://selenium.dev;);
> driver.quit();



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3028:
---
Affects Version/s: 1.19

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3028:
---
Fix Version/s: 1.21

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834481#comment-17834481
 ] 

ASF GitHub Bot commented on NUTCH-3038:
---

lewismc opened a new pull request, #811:
URL: https://github.com/apache/nutch/pull/811

   PR for https://issues.apache.org/jira/browse/NUTCH-3038




> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-05 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3038 started by Lewis John McGibbney.
---
> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-05 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3038:

Description: 
During the 1.20 release management dryrun I discovered the following issues 
which I think should be addressed in order to be satisfied with the release 
candidate
 # Update docker/README to remove broken badge
 # Upgrade alpine base image in docker/Dockerfile
 # Migrate CHANGES.txt to CHANGES.md
 # Upgrade apache parent pom version from 23 to 31
 # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
 # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
ivy/mvn.template
 # Remove miredot plugin usage from ivy/mvn.template

  was:
During the 1.20 release management dryrun I discovered the following issues 
which I think should be addressed in order to be satisfied with the release 
candidate
 # Update docker/README to remove broken badge
 # Upgrade alpine base image in docker/Dockerfile
 # Migrate CHANGES.txt to CHANGES.md
 # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
 # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
ivy/mvn.template
 # Remove miredot plugin usage from ivy/mvn.template


> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >