[jira] [Created] (NUTCH-3059) Generator: selector job does not count reduce output records
Sebastian Nagel created NUTCH-3059: -- Summary: Generator: selector job does not count reduce output records Key: NUTCH-3059 URL: https://issues.apache.org/jira/browse/NUTCH-3059 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.20 Reporter: Sebastian Nagel Fix For: 1.21 The selector step (job) of the Generator does not count the reduce output records resp. shows the count "0": {noformat} 2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: starting 2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: selecting best-scoring urls due for fetch. ... Map-Reduce Framework Map input records=6 Map output records=6 ... Combine input records=0 Combine output records=0 Reduce input groups=1 Reduce shuffle bytes=594 Reduce input records=6 Reduce output records=0 Spilled Records=12 ... {noformat} Not a big issue but should investigate why this happens. The other counters seem to work properly, also the partitioner job shows the reduce output records. The issue is observed in local and distributed mode. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3058) Fetcher: counter for hung threads
[ https://issues.apache.org/jira/browse/NUTCH-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852421#comment-17852421 ] ASF GitHub Bot commented on NUTCH-3058: --- sebastian-nagel opened a new pull request, #820: URL: https://github.com/apache/nutch/pull/820 - count the number of hung threads in a fetcher job - log and count the number of fetch items still queued when the "hard" timeout is reached > Fetcher: counter for hung threads > - > > Key: NUTCH-3058 > URL: https://issues.apache.org/jira/browse/NUTCH-3058 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.21 > > > The Fetcher class defines a "hard" timeout defined as 50% of the MapReduce > task timeout, see {{mapreduce.task.timeout}} and > {{fetcher.threads.timeout.divisor}}. If there are fetcher threads running but > without any progress during the timeout period (in terms of newly started > fetch items), Fetcher is shut down to avoid that the task timeout is reached > and the fetcher job is failed. The "hung threads" are logged together with > the URL being fetched and (DEBUG level) the Java stack. > In addition to logging, a job counter should indicate the number of hung > threads. This would allow to see on the job level whether there are issues > with hung threads. To trace the issues it's still required to look into the > Hadoop task logs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3058) Fetcher: counter for hung threads
Sebastian Nagel created NUTCH-3058: -- Summary: Fetcher: counter for hung threads Key: NUTCH-3058 URL: https://issues.apache.org/jira/browse/NUTCH-3058 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.20 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Fix For: 1.21 The Fetcher class defines a "hard" timeout defined as 50% of the MapReduce task timeout, see {{mapreduce.task.timeout}} and {{fetcher.threads.timeout.divisor}}. If there are fetcher threads running but without any progress during the timeout period (in terms of newly started fetch items), Fetcher is shut down to avoid that the task timeout is reached and the fetcher job is failed. The "hung threads" are logged together with the URL being fetched and (DEBUG level) the Java stack. In addition to logging, a job counter should indicate the number of hung threads. This would allow to see on the job level whether there are issues with hung threads. To trace the issues it's still required to look into the Hadoop task logs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails
[ https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850039#comment-17850039 ] Hudson commented on NUTCH-3044: --- SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #163 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/163/]) NUTCH-3044 Generator: NPE when extracting the host part of a URL fails (snagel: [https://github.com/apache/nutch/commit/4b263533a9cdea208383fdbb0a8cc0b537423d7f]) * (edit) src/java/org/apache/nutch/crawl/Generator.java NUTCH-3044 Generator: NPE when extracting the host part of a URL fails (snagel: [https://github.com/apache/nutch/commit/4729786e4d7f9e1136580ceb191274862d03ba5b]) * (edit) src/test/org/apache/nutch/crawl/TestGenerator.java NUTCH-3044 Generator: NPE when extracting the host part of a URL fails (snagel: [https://github.com/apache/nutch/commit/b153279ad5844b32560ecf62a8e7f83f8ecbd43c]) * (edit) src/java/org/apache/nutch/crawl/Generator.java * (edit) src/test/org/apache/nutch/crawl/TestGenerator.java > Generator: NPE when extracting the host part of a URL fails > --- > > Key: NUTCH-3044 > URL: https://issues.apache.org/jira/browse/NUTCH-3044 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > When extracting the host part of a URL fails, the Generator job fails because > of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb > contains an malformed URL, for example, a URL with an unsupported scheme > (smb://). > {noformat} > Caused by: java.lang.NullPointerException > at > org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439) > at > org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3055) README: fix Github "hub" commands
[ https://issues.apache.org/jira/browse/NUTCH-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850040#comment-17850040 ] Hudson commented on NUTCH-3055: --- SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #163 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/163/]) NUTCH-3055 README: fix Github "hub" commands (snagel: [https://github.com/apache/nutch/commit/ca03d9b76485b7c9d50dff2c3946bb8189daf5e1]) * (edit) README.md > README: fix Github "hub" commands > - > > Key: NUTCH-3055 > URL: https://issues.apache.org/jira/browse/NUTCH-3055 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Trivial > Fix For: 1.21 > > > The [README.md|https://github.com/apache/nutch/blob/master/README.md] > contains [Github hub|https://hub.github.com/] commands but with "git" as > command (executable) name, maybe an alias or some other magic. However, if > hub isn't installed, these commands fail with {{git: 'pull-request' is not a > git command. See 'git --help'.}} or similar. > We should use the command "hub" instead. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3055) README: fix Github "hub" commands
[ https://issues.apache.org/jira/browse/NUTCH-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3055. Resolution: Fixed > README: fix Github "hub" commands > - > > Key: NUTCH-3055 > URL: https://issues.apache.org/jira/browse/NUTCH-3055 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Trivial > Fix For: 1.21 > > > The [README.md|https://github.com/apache/nutch/blob/master/README.md] > contains [Github hub|https://hub.github.com/] commands but with "git" as > command (executable) name, maybe an alias or some other magic. However, if > hub isn't installed, these commands fail with {{git: 'pull-request' is not a > git command. See 'git --help'.}} or similar. > We should use the command "hub" instead. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3055) README: fix Github "hub" commands
[ https://issues.apache.org/jira/browse/NUTCH-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850005#comment-17850005 ] ASF GitHub Bot commented on NUTCH-3055: --- sebastian-nagel merged PR #818: URL: https://github.com/apache/nutch/pull/818 > README: fix Github "hub" commands > - > > Key: NUTCH-3055 > URL: https://issues.apache.org/jira/browse/NUTCH-3055 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Trivial > Fix For: 1.21 > > > The [README.md|https://github.com/apache/nutch/blob/master/README.md] > contains [Github hub|https://hub.github.com/] commands but with "git" as > command (executable) name, maybe an alias or some other magic. However, if > hub isn't installed, these commands fail with {{git: 'pull-request' is not a > git command. See 'git --help'.}} or similar. > We should use the command "hub" instead. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails
[ https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3044. Resolution: Fixed > Generator: NPE when extracting the host part of a URL fails > --- > > Key: NUTCH-3044 > URL: https://issues.apache.org/jira/browse/NUTCH-3044 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > When extracting the host part of a URL fails, the Generator job fails because > of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb > contains an malformed URL, for example, a URL with an unsupported scheme > (smb://). > {noformat} > Caused by: java.lang.NullPointerException > at > org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439) > at > org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails
[ https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850004#comment-17850004 ] ASF GitHub Bot commented on NUTCH-3044: --- sebastian-nagel merged PR #815: URL: https://github.com/apache/nutch/pull/815 > Generator: NPE when extracting the host part of a URL fails > --- > > Key: NUTCH-3044 > URL: https://issues.apache.org/jira/browse/NUTCH-3044 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > When extracting the host part of a URL fails, the Generator job fails because > of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb > contains an malformed URL, for example, a URL with an unsupported scheme > (smb://). > {noformat} > Caused by: java.lang.NullPointerException > at > org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439) > at > org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3057) Arbitrary indexer "leaks" previous value into a field processed after an exception
[ https://issues.apache.org/jira/browse/NUTCH-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847521#comment-17847521 ] Joe Gilvary commented on NUTCH-3057: Happy Saturday, [~lewi...@apache.org], I worked on the plugin and this fix with some raspberry pi hosts at home, but of course, found the error at work. I didn't see it until I was running with the 1.20 release in a pre-prod system. I set up individual POJOs for a few fields and added a typo in nutch-site.xml. As soon as I saw the exception during indexing and what made it into Solr, I knew what was wrong. A D'oh! moment indeed. Let me know, please, if there's anything else I need to do, process-wise, to have this correct for the next distro. > Arbitrary indexer "leaks" previous value into a field processed after an > exception > -- > > Key: NUTCH-3057 > URL: https://issues.apache.org/jira/browse/NUTCH-3057 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.20 >Reporter: Joe Gilvary >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3057) Arbitrary indexer "leaks" previous value into a field processed after an exception
[ https://issues.apache.org/jira/browse/NUTCH-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847470#comment-17847470 ] ASF GitHub Bot commented on NUTCH-3057: --- lewismc commented on PR #819: URL: https://github.com/apache/nutch/pull/819#issuecomment-2118551238 Thanks for reporting @CatChullain i didn’t catch this edge case either when reviewing or testing. Out curiosity what does your deployment look like? Local or deploy? > Arbitrary indexer "leaks" previous value into a field processed after an > exception > -- > > Key: NUTCH-3057 > URL: https://issues.apache.org/jira/browse/NUTCH-3057 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.20 >Reporter: Joe Gilvary >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3057) Arbitrary indexer "leaks" previous value into a field processed after an exception
[ https://issues.apache.org/jira/browse/NUTCH-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847462#comment-17847462 ] ASF GitHub Bot commented on NUTCH-3057: --- CatChullain opened a new pull request, #819: URL: https://github.com/apache/nutch/pull/819 Fix for NUTCH-3057 where index-arbitrary plugin retained value for a field and erroneously set it to the next field declared in its config stanzas > Arbitrary indexer "leaks" previous value into a field processed after an > exception > -- > > Key: NUTCH-3057 > URL: https://issues.apache.org/jira/browse/NUTCH-3057 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.20 >Reporter: Joe Gilvary >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3057) Arbitrary indexer "leaks" previous value into a field processed after an exception
[ https://issues.apache.org/jira/browse/NUTCH-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847453#comment-17847453 ] Joe Gilvary commented on NUTCH-3057: The arbitrary indexer plug-in can add multiple new fields to a doc by appending numeric suffixes to the config values for each. If an exception interferes with setting a value and there's a config for a successive field to process, the plug in can insert the wrong value for that successively-configured field. > Arbitrary indexer "leaks" previous value into a field processed after an > exception > -- > > Key: NUTCH-3057 > URL: https://issues.apache.org/jira/browse/NUTCH-3057 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.20 >Reporter: Joe Gilvary >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3057) Arbitrary indexer "leaks" previous value into a field processed after an exception
Joe Gilvary created NUTCH-3057: -- Summary: Arbitrary indexer "leaks" previous value into a field processed after an exception Key: NUTCH-3057 URL: https://issues.apache.org/jira/browse/NUTCH-3057 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.20 Reporter: Joe Gilvary -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs
[ https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3056: - Description: We have a case where clients submit huge uncurated seed files, the host may not longer exist, or redirect via-via to elsewhere, the protocol may be incorrect etc. The large crawl itself is not supposed to venture much beyond the seed list, except for regex exceptions listed in {color:#00}db-ignore-external-exemptions{color}. It is also not allowed to jump to other domains/hosts to control the size of the crawl. This means externally redirecting seeds will not be crawled. This ticket will add support for a multi-threaded host/domain/protocol/redirecter/resolver to the injector. Seeds not leading to a non-200 URL will be discarded. Enabling filtering and normalization is highly recommended for handling the redirects. If you have a seed file with 10k+ or millions of records, you are highly recommended to split the input file in chunks so that multiple mappers can get to work. Passing a few millions records without resolving through one mapper is no problem, but resolving millions with one mapper, even if threaded, will take many hours. was: We have a case where clients submit huge uncurated seed files, the host may not longer exist, or redirect via-via to elsewhere, the protocol may be incorrect etc. The large crawl itself is not supposed to venture much beyond the seed list, except for regex exceptions listed in {color:#00}db-ignore-external-exemptions{color}. It is also not allowed to jump to other domains/hosts to control the size of the crawl. This means externally redirecting seeds will not be crawled. This ticket will add support for a multi-threaded host/domain/protocol/redirecter/resolver to the injector. If you have a seed file with 10k+ or millions of records, you are highly recommended to split the input file in chunks so that multiple mappers can get to work. Passing a few millions records without resolving through one mapper is no problem, but resolving millions with one mapper, even if threaded, will take many hours. > Injector to support resolving seed URLs > --- > > Key: NUTCH-3056 > URL: https://issues.apache.org/jira/browse/NUTCH-3056 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > > We have a case where clients submit huge uncurated seed files, the host may > not longer exist, or redirect via-via to elsewhere, the protocol may be > incorrect etc. > The large crawl itself is not supposed to venture much beyond the seed list, > except for regex exceptions listed in > {color:#00}db-ignore-external-exemptions{color}. It is also not allowed > to jump to other domains/hosts to control the size of the crawl. This means > externally redirecting seeds will not be crawled. > This ticket will add support for a multi-threaded > host/domain/protocol/redirecter/resolver to the injector. Seeds not leading > to a non-200 URL will be discarded. Enabling filtering and normalization is > highly recommended for handling the redirects. > If you have a seed file with 10k+ or millions of records, you are highly > recommended to split the input file in chunks so that multiple mappers can > get to work. Passing a few millions records without resolving through one > mapper is no problem, but resolving millions with one mapper, even if > threaded, will take many hours. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs
[ https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3056: - Description: We have a case where clients submit huge uncurated seed files, the host may not longer exist, or redirect via-via to elsewhere, the protocol may be incorrect etc. The large crawl itself is not supposed to venture much beyond the seed list, except for regex exceptions listed in {color:#00}db-ignore-external-exemptions{color}. It is also not allowed to jump to other domains/hosts to control the size of the crawl. This means externally redirecting seeds will not be crawled. This ticket will add support for a multi-threaded host/domain/protocol/redirecter/resolver to the injector. If you have a seed file with 10k+ or millions of records, you are highly recommended to split the input file in chunks so that multiple mappers can get to work. Passing a few millions records without resolving through one mapper is no problem, but resolving millions with one mapper, even if threaded, will take many hours. was: We have a case where clients submit huge uncurated seed files, the host may not longer exist, or redirect via-via to elsewhere, the protocol may be incorrect etc. The large crawl itself is not supposed to venture much beyond the seed list, except for regex exceptions listed in {color:#00}db-ignore-external-exemptions{color}. It is also not allowed to jump to other domains/hosts to control the size of the crawl. This means externally redirecting seeds will not be crawled. This ticket will add support for a multi-threaded host/domain/protocol/redirecter/resolver to the injector. > Injector to support resolving seed URLs > --- > > Key: NUTCH-3056 > URL: https://issues.apache.org/jira/browse/NUTCH-3056 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > > We have a case where clients submit huge uncurated seed files, the host may > not longer exist, or redirect via-via to elsewhere, the protocol may be > incorrect etc. > The large crawl itself is not supposed to venture much beyond the seed list, > except for regex exceptions listed in > {color:#00}db-ignore-external-exemptions{color}. It is also not allowed > to jump to other domains/hosts to control the size of the crawl. This means > externally redirecting seeds will not be crawled. > This ticket will add support for a multi-threaded > host/domain/protocol/redirecter/resolver to the injector. > If you have a seed file with 10k+ or millions of records, you are highly > recommended to split the input file in chunks so that multiple mappers can > get to work. Passing a few millions records without resolving through one > mapper is no problem, but resolving millions with one mapper, even if > threaded, will take many hours. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3056) Injector to support resolving seed URLs
Markus Jelsma created NUTCH-3056: Summary: Injector to support resolving seed URLs Key: NUTCH-3056 URL: https://issues.apache.org/jira/browse/NUTCH-3056 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.21 We have a case where clients submit huge uncurated seed files, the host may not longer exist, or redirect via-via to elsewhere, the protocol may be incorrect etc. The large crawl itself is not supposed to venture much beyond the seed list, except for regex exceptions listed in {color:#00}db-ignore-external-exemptions{color}. It is also not allowed to jump to other domains/hosts to control the size of the crawl. This means externally redirecting seeds will not be crawled. This ticket will add support for a multi-threaded host/domain/protocol/redirecter/resolver to the injector. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
[ https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846795#comment-17846795 ] Hudson commented on NUTCH-3041: --- SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #162 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/162/]) NUTCH-3041 Address confusing logging in o.a.n.net.URLExemptionFilters (#813) (github: [https://github.com/apache/nutch/commit/8abc78a653eb7970def10031d732fb4c7aa0fb6f]) * (edit) src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java * (edit) src/java/org/apache/nutch/net/URLExemptionFilters.java * (edit) src/plugin/urlfilter-ignoreexempt/README.md > Address confusing logging in o.a.n.net.URLExemptionFilters > --- > > Key: NUTCH-3041 > URL: https://issues.apache.org/jira/browse/NUTCH-3041 > Project: Nutch > Issue Type: Task > Components: net >Affects Versions: 1.19, 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > URLExemptionFilter impementations are used to allow exemptions to external > domain resources by overriding the {{db.ignore.external.links}} configuration > setting. This is useful when the crawl is focused to a domain but resources > like images are hosted on CDN. > Currently [URLExemptionFilters|#L47-L48]] provides the following logging > {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor > #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' > {quote} > I find this confusing. It would be better to log *only* if an > URLExemptionFilter implementation is actually configured to be used at > runtime. > I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
[ https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-3041. --- > Address confusing logging in o.a.n.net.URLExemptionFilters > --- > > Key: NUTCH-3041 > URL: https://issues.apache.org/jira/browse/NUTCH-3041 > Project: Nutch > Issue Type: Task > Components: net >Affects Versions: 1.19, 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > URLExemptionFilter impementations are used to allow exemptions to external > domain resources by overriding the {{db.ignore.external.links}} configuration > setting. This is useful when the crawl is focused to a domain but resources > like images are hosted on CDN. > Currently [URLExemptionFilters|#L47-L48]] provides the following logging > {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor > #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' > {quote} > I find this confusing. It would be better to log *only* if an > URLExemptionFilter implementation is actually configured to be used at > runtime. > I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work stopped] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
[ https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3041 stopped by Lewis John McGibbney. --- > Address confusing logging in o.a.n.net.URLExemptionFilters > --- > > Key: NUTCH-3041 > URL: https://issues.apache.org/jira/browse/NUTCH-3041 > Project: Nutch > Issue Type: Task > Components: net >Affects Versions: 1.19, 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > URLExemptionFilter impementations are used to allow exemptions to external > domain resources by overriding the {{db.ignore.external.links}} configuration > setting. This is useful when the crawl is focused to a domain but resources > like images are hosted on CDN. > Currently [URLExemptionFilters|#L47-L48]] provides the following logging > {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor > #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' > {quote} > I find this confusing. It would be better to log *only* if an > URLExemptionFilter implementation is actually configured to be used at > runtime. > I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
[ https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-3041. - Resolution: Fixed > Address confusing logging in o.a.n.net.URLExemptionFilters > --- > > Key: NUTCH-3041 > URL: https://issues.apache.org/jira/browse/NUTCH-3041 > Project: Nutch > Issue Type: Task > Components: net >Affects Versions: 1.19, 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > URLExemptionFilter impementations are used to allow exemptions to external > domain resources by overriding the {{db.ignore.external.links}} configuration > setting. This is useful when the crawl is focused to a domain but resources > like images are hosted on CDN. > Currently [URLExemptionFilters|#L47-L48]] provides the following logging > {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor > #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' > {quote} > I find this confusing. It would be better to log *only* if an > URLExemptionFilter implementation is actually configured to be used at > runtime. > I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
[ https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846788#comment-17846788 ] ASF GitHub Bot commented on NUTCH-3041: --- lewismc merged PR #813: URL: https://github.com/apache/nutch/pull/813 > Address confusing logging in o.a.n.net.URLExemptionFilters > --- > > Key: NUTCH-3041 > URL: https://issues.apache.org/jira/browse/NUTCH-3041 > Project: Nutch > Issue Type: Task > Components: net >Affects Versions: 1.19, 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > URLExemptionFilter impementations are used to allow exemptions to external > domain resources by overriding the {{db.ignore.external.links}} configuration > setting. This is useful when the crawl is focused to a domain but resources > like images are hosted on CDN. > Currently [URLExemptionFilters|#L47-L48]] provides the following logging > {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor > #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' > {quote} > I find this confusing. It would be better to log *only* if an > URLExemptionFilter implementation is actually configured to be used at > runtime. > I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846402#comment-17846402 ] Hudson commented on NUTCH-3043: --- SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #161 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/161/]) NUTCH-3043 Generator: count URLs rejected by URL filters (#814) (github: [https://github.com/apache/nutch/commit/5f1330a03d136440a167a85da6cfe8ac4b3f61b9]) * (edit) src/java/org/apache/nutch/crawl/Generator.java > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3039) Failure to handle ftp:// URLs
[ https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846401#comment-17846401 ] Hudson commented on NUTCH-3039: --- SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #161 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/161/]) NUTCH-3039 Failure to handle ftp:// URLs (snagel: [https://github.com/apache/nutch/commit/ea9c7ee5d6635405b31b4a1d462cca746478b040]) * (edit) src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java > Failure to handle ftp:// URLs > - > > Key: NUTCH-3039 > URL: https://issues.apache.org/jira/browse/NUTCH-3039 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.21 > > > Nutch fails to handle ftp:// URLs: > - URLNormalizerBasic returns the empty string because creating the URL > instance fails with a MalformedURLException: > {noformat} > echo "ftp://ftp.example.com/path/file.txt; \ > | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat} > - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due > to a MalformedURLException: > {noformat} > bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \ >"ftp://ftp.example.com/path/file.txt; > ... > Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: > java.net.MalformedURLException > at > org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113) > ...{noformat} > The issue is caused by NUTCH-2429: > - we do not provide a dedicated URL stream handler for ftp URLs > - but also do not pass ftp:// URLs to the standard JVM handler -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3043. Resolution: Implemented > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846357#comment-17846357 ] ASF GitHub Bot commented on NUTCH-3043: --- sebastian-nagel commented on PR #814: URL: https://github.com/apache/nutch/pull/814#issuecomment-2110558876 Thanks, @lewismc! The metrics wiki page was updated. > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846355#comment-17846355 ] ASF GitHub Bot commented on NUTCH-3043: --- sebastian-nagel merged PR #814: URL: https://github.com/apache/nutch/pull/814 > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3039) Failure to handle ftp:// URLs
[ https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3039. Resolution: Fixed > Failure to handle ftp:// URLs > - > > Key: NUTCH-3039 > URL: https://issues.apache.org/jira/browse/NUTCH-3039 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.21 > > > Nutch fails to handle ftp:// URLs: > - URLNormalizerBasic returns the empty string because creating the URL > instance fails with a MalformedURLException: > {noformat} > echo "ftp://ftp.example.com/path/file.txt; \ > | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat} > - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due > to a MalformedURLException: > {noformat} > bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \ >"ftp://ftp.example.com/path/file.txt; > ... > Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: > java.net.MalformedURLException > at > org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113) > ...{noformat} > The issue is caused by NUTCH-2429: > - we do not provide a dedicated URL stream handler for ftp URLs > - but also do not pass ftp:// URLs to the standard JVM handler -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3039) Failure to handle ftp:// URLs
[ https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846345#comment-17846345 ] ASF GitHub Bot commented on NUTCH-3039: --- sebastian-nagel merged PR #812: URL: https://github.com/apache/nutch/pull/812 > Failure to handle ftp:// URLs > - > > Key: NUTCH-3039 > URL: https://issues.apache.org/jira/browse/NUTCH-3039 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.21 > > > Nutch fails to handle ftp:// URLs: > - URLNormalizerBasic returns the empty string because creating the URL > instance fails with a MalformedURLException: > {noformat} > echo "ftp://ftp.example.com/path/file.txt; \ > | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat} > - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due > to a MalformedURLException: > {noformat} > bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \ >"ftp://ftp.example.com/path/file.txt; > ... > Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: > java.net.MalformedURLException > at > org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113) > ...{noformat} > The issue is caused by NUTCH-2429: > - we do not provide a dedicated URL stream handler for ftp URLs > - but also do not pass ftp:// URLs to the standard JVM handler -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842526#comment-17842526 ] Joe Gilvary commented on NUTCH-585: --- [~dbeckstrom] I'm not sure which patch you were asking about. I used the source for the new 1.20 release and applied the patch that [~ad-...@gmx.at] posted after an edit to the line numbers for the update to src/plugin/build.xml. It built cleanly and seems to work exactly as advertised in my tests with indexchecker. > [PARSE-HTML plugin] Block certain parts of HTML code from being indexed > --- > > Key: NUTCH-585 > URL: https://issues.apache.org/jira/browse/NUTCH-585 > Project: Nutch > Issue Type: Improvement > Components: HTML, parse-filter, parser, plugin >Affects Versions: 0.9.0 > Environment: All operating systems >Reporter: Andrea Spinelli >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.21 > > Attachments: blacklist_whitelist_plugin.patch, > nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch > > > We are using nutch to index our own web sites; we would like not to index > certain parts of our pages, because we know they are not relevant (for > instance, there are several links to change the background color) and > generate spurious matches. > We have modified the plugin so that it ignores HTML code between certain HTML > comments, like > > ... ignored part ... > > We feel this might be useful to someone else, maybe factorizing the comment > strings as constants in the configuration files (say parser.html.ignore.start > and parser.html.ignore.stop in nutch-site.xml). > We are almost ready to contribute our code snippet. Looking forward for any > expression of interest - or for an explanation why waht we are doing is > plain wrong! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions
[ https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842426#comment-17842426 ] Hudson commented on NUTCH-3054: --- SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #160 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/160/]) NUTCH-3054 Address deprecation of Node16 for all GitHub Actions (#817) (github: [https://github.com/apache/nutch/commit/7ac3ce28e065fb5160f96ce7bce1ec840f87d0dc]) * (edit) .github/workflows/master-build.yml > Address deprecation of Node16 for all GitHub Actions > > > Key: NUTCH-3054 > URL: https://issues.apache.org/jira/browse/NUTCH-3054 > Project: Nutch > Issue Type: Task > Components: ci/cd >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > See > [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/] > We need to upgrade the setup-java action in > [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] > > Patch coming up -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions
[ https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-3054. --- > Address deprecation of Node16 for all GitHub Actions > > > Key: NUTCH-3054 > URL: https://issues.apache.org/jira/browse/NUTCH-3054 > Project: Nutch > Issue Type: Task > Components: ci/cd >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > See > [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/] > We need to upgrade the setup-java action in > [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] > > Patch coming up -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions
[ https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-3054. - Resolution: Fixed > Address deprecation of Node16 for all GitHub Actions > > > Key: NUTCH-3054 > URL: https://issues.apache.org/jira/browse/NUTCH-3054 > Project: Nutch > Issue Type: Task > Components: ci/cd >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > See > [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/] > We need to upgrade the setup-java action in > [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] > > Patch coming up -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions
[ https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842410#comment-17842410 ] ASF GitHub Bot commented on NUTCH-3054: --- lewismc merged PR #817: URL: https://github.com/apache/nutch/pull/817 > Address deprecation of Node16 for all GitHub Actions > > > Key: NUTCH-3054 > URL: https://issues.apache.org/jira/browse/NUTCH-3054 > Project: Nutch > Issue Type: Task > Components: ci/cd >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > See > [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/] > We need to upgrade the setup-java action in > [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] > > Patch coming up -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842384#comment-17842384 ] Markus Jelsma commented on NUTCH-3028: -- Ok, the Content object is now also available in the evaluation. I added an example of it to the description above. > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > Attachments: NUTCH-3028-1.patch, NUTCH-3028-2.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} > {color:#00}or {color} > {color:#00}-expr > 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Attachment: NUTCH-3028-2.patch > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > Attachments: NUTCH-3028-1.patch, NUTCH-3028-2.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} > {color:#00}or {color} > {color:#00}-expr > 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Description: Filtering segment data to WARC is now possible using JEXL expressions. In the next example, all records with SOME_KEY=SOME_VALUE in their parseData metadata are exported to WARC. {color:#00}-expr 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} {color:#00}or {color} {color:#00}-expr 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color} was: Filtering segment data to WARC is now possible using JEXL expressions. In the next example, all records with SOME_KEY=SOME_VALUE in their parseData metadata are exported to WARC. {color:#00}-expr 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} > {color:#00}or {color} > {color:#00}-expr > 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3055) README: fix Github "hub" commands
[ https://issues.apache.org/jira/browse/NUTCH-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842308#comment-17842308 ] ASF GitHub Bot commented on NUTCH-3055: --- sebastian-nagel opened a new pull request, #818: URL: https://github.com/apache/nutch/pull/818 (no comment) > README: fix Github "hub" commands > - > > Key: NUTCH-3055 > URL: https://issues.apache.org/jira/browse/NUTCH-3055 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Trivial > Fix For: 1.21 > > > The [README.md|https://github.com/apache/nutch/blob/master/README.md] > contains [Github hub|https://hub.github.com/] commands but with "git" as > command (executable) name, maybe an alias or some other magic. However, if > hub isn't installed, these commands fail with {{git: 'pull-request' is not a > git command. See 'git --help'.}} or similar. > We should use the command "hub" instead. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3055) README: fix Github "hub" commands
Sebastian Nagel created NUTCH-3055: -- Summary: README: fix Github "hub" commands Key: NUTCH-3055 URL: https://issues.apache.org/jira/browse/NUTCH-3055 Project: Nutch Issue Type: Bug Components: documentation Affects Versions: 1.20 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Fix For: 1.21 The [README.md|https://github.com/apache/nutch/blob/master/README.md] contains [Github hub|https://hub.github.com/] commands but with "git" as command (executable) name, maybe an alias or some other magic. However, if hub isn't installed, these commands fail with {{git: 'pull-request' is not a git command. See 'git --help'.}} or similar. We should use the command "hub" instead. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842291#comment-17842291 ] Sebastian Nagel commented on NUTCH-3028: +1 lgtm. One question: if there is no parseData, the JEXL expression is not evaluated. Since WARC files may inlcude only the raw HTML plus fetch/capture metadata, successfully parsing a document is not a requirement to archive it in a WARC file. Might be useful to have the JEXL filtering also available for unparsed docs. > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3045) Upgrade from Java 11 to 17
[ https://issues.apache.org/jira/browse/NUTCH-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842284#comment-17842284 ] Sebastian Nagel commented on NUTCH-3045: See also NUTCH-2987. Until HADOOP-17177 / HADOOP-18887 are done, we might be forced to upkeep JDK 11 runtime compatibility, so that Nutch runs on recent Hadoop versions and distributions. I fully agree that Java 17 offers some nice syntax improvements, though. :) > Upgrade from Java 11 to 17 > -- > > Key: NUTCH-3045 > URL: https://issues.apache.org/jira/browse/NUTCH-3045 > Project: Nutch > Issue Type: Task > Components: build, ci/cd >Reporter: Lewis John McGibbney >Priority: Critical > Fix For: 1.21 > > > This parent issue will track and organize work pertaining to upgrading Nutch > to JDK 17. > Premier support for Oracle JDK 11 ended 7 months ago (30 Sep 2023). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions
[ https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842209#comment-17842209 ] ASF GitHub Bot commented on NUTCH-3054: --- lewismc opened a new pull request, #817: URL: https://github.com/apache/nutch/pull/817 Addresses https://issues.apache.org/jira/browse/NUTCH-3054 > Address deprecation of Node16 for all GitHub Actions > > > Key: NUTCH-3054 > URL: https://issues.apache.org/jira/browse/NUTCH-3054 > Project: Nutch > Issue Type: Task > Components: ci/cd >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > See > [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/] > We need to upgrade the setup-java action in > [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] > > Patch coming up -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions
[ https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3054: Affects Version/s: 1.20 > Address deprecation of Node16 for all GitHub Actions > > > Key: NUTCH-3054 > URL: https://issues.apache.org/jira/browse/NUTCH-3054 > Project: Nutch > Issue Type: Task > Components: ci/cd >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > See > [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/] > We need to upgrade the setup-java action in > [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] > > Patch coming up -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions
Lewis John McGibbney created NUTCH-3054: --- Summary: Address deprecation of Node16 for all GitHub Actions Key: NUTCH-3054 URL: https://issues.apache.org/jira/browse/NUTCH-3054 Project: Nutch Issue Type: Task Components: ci/cd Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.21 See [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/] We need to upgrade the setup-java action in [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] Patch coming up -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work started] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions
[ https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3054 started by Lewis John McGibbney. --- > Address deprecation of Node16 for all GitHub Actions > > > Key: NUTCH-3054 > URL: https://issues.apache.org/jira/browse/NUTCH-3054 > Project: Nutch > Issue Type: Task > Components: ci/cd >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > See > [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/] > We need to upgrade the setup-java action in > [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] > > Patch coming up -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3049) Investigate using Records
[ https://issues.apache.org/jira/browse/NUTCH-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842208#comment-17842208 ] Lewis John McGibbney commented on NUTCH-3049: - I think that each of the Writable classes mentioned in NutchWritable may be fair game {{ org.apache.nutch.crawl.CrawlDatum.class,}} {{ org.apache.nutch.crawl.Inlink.class,}} {{ org.apache.nutch.crawl.Inlinks.class,}} {{ org.apache.nutch.indexer.NutchIndexAction.class,}} {{ org.apache.nutch.metadata.Metadata.class,}} {{ org.apache.nutch.parse.Outlink.class,}} {{ org.apache.nutch.parse.ParseText.class,}} {{ org.apache.nutch.parse.ParseData.class,}} {{ org.apache.nutch.parse.ParseImpl.class,}} {{ org.apache.nutch.parse.ParseStatus.class,}} {{ org.apache.nutch.protocol.Content.class,}} {{ org.apache.nutch.protocol.ProtocolStatus.class,}} {{ org.apache.nutch.scoring.webgraph.LinkDatum.class,}} {{ org.apache.nutch.hostdb.HostDatum.class}} > Investigate using Records > - > > Key: NUTCH-3049 > URL: https://issues.apache.org/jira/browse/NUTCH-3049 > Project: Nutch > Issue Type: Sub-task >Reporter: Lewis John McGibbney >Priority: Major > > Guidance at [https://www.baeldung.com/java-migrate-8-to-17#records] > i think there are multiple areas where we could use Records. This ticket will > document the opportunities and structure that work. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3053) Upgrade build and CI to JDK17
Lewis John McGibbney created NUTCH-3053: --- Summary: Upgrade build and CI to JDK17 Key: NUTCH-3053 URL: https://issues.apache.org/jira/browse/NUTCH-3053 Project: Nutch Issue Type: Sub-task Components: build, ci/cd Reporter: Lewis John McGibbney This will involves changes to * [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] * [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/] * [https://github.com/apache/nutch/blob/master/default.properties#L46] * [https://github.com/apache/nutch/blob/master/default.properties#L57] * We should also investigate any deprecation notices in the build output * [https://github.com/apache/nutch/blob/master/ivy/mvn.template#L128-L129] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3052) Investigate using sealed classes
Lewis John McGibbney created NUTCH-3052: --- Summary: Investigate using sealed classes Key: NUTCH-3052 URL: https://issues.apache.org/jira/browse/NUTCH-3052 Project: Nutch Issue Type: Sub-task Reporter: Lewis John McGibbney Guidance available at [https://www.baeldung.com/java-migrate-8-to-17#sealed-classes] First document if and where sealed classes would add value. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3051) Investigate using new pattern matching syntax in switch expressions
Lewis John McGibbney created NUTCH-3051: --- Summary: Investigate using new pattern matching syntax in switch expressions Key: NUTCH-3051 URL: https://issues.apache.org/jira/browse/NUTCH-3051 Project: Nutch Issue Type: Sub-task Reporter: Lewis John McGibbney Guidance available at [https://www.baeldung.com/java-migrate-8-to-17#2-switch-expressions] Apparently we use switch in 35 files [https://github.com/search?q=repo%3Aapache%2Fnutch+switch+language%3AJava=code=Java] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3050) Investigate use of the enhanced instanceof operator
Lewis John McGibbney created NUTCH-3050: --- Summary: Investigate use of the enhanced instanceof operator Key: NUTCH-3050 URL: https://issues.apache.org/jira/browse/NUTCH-3050 Project: Nutch Issue Type: Sub-task Reporter: Lewis John McGibbney Guidance at [https://www.baeldung.com/java-migrate-8-to-17#1-enhanced-instanceof-operator] Apparently we use instanceof operator in 50 files [https://github.com/search?q=repo%3Aapache%2Fnutch%20instanceof=code] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3049) Investigate using Records
Lewis John McGibbney created NUTCH-3049: --- Summary: Investigate using Records Key: NUTCH-3049 URL: https://issues.apache.org/jira/browse/NUTCH-3049 Project: Nutch Issue Type: Sub-task Reporter: Lewis John McGibbney Guidance at [https://www.baeldung.com/java-migrate-8-to-17#records] i think there are multiple areas where we could use Records. This ticket will document the opportunities and structure that work. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3048) Investigate where/if new string utility methods could be used
Lewis John McGibbney created NUTCH-3048: --- Summary: Investigate where/if new string utility methods could be used Key: NUTCH-3048 URL: https://issues.apache.org/jira/browse/NUTCH-3048 Project: Nutch Issue Type: Sub-task Components: util Reporter: Lewis John McGibbney Guidance at [https://www.baeldung.com/java-migrate-8-to-17#3-new-string-methods] We may be able to also revisit our usage of common-* libraries with tje goal of using native methods from JDK. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3047) Use multi-line text blocks
Lewis John McGibbney created NUTCH-3047: --- Summary: Use multi-line text blocks Key: NUTCH-3047 URL: https://issues.apache.org/jira/browse/NUTCH-3047 Project: Nutch Issue Type: Sub-task Components: CLI Reporter: Lewis John McGibbney Guidance available at [https://www.baeldung.com/java-migrate-8-to-17#2-text-block] This will help to cleanup our CLI *usage()* messages at a bare minimum. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3046) Use compact strings
[ https://issues.apache.org/jira/browse/NUTCH-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3046: Description: Follow the guidance at [https://www.baeldung.com/java-migrate-8-to-17#1-compact-string] It looks like there are 9 instances where we use _*char []*_ |[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]]. was: Follow the guidance at [https://www.baeldung.com/java-migrate-8-to-17#1-compact-string] It looks like there are [9 instances where we use char[]|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]]. > Use compact strings > --- > > Key: NUTCH-3046 > URL: https://issues.apache.org/jira/browse/NUTCH-3046 > Project: Nutch > Issue Type: Sub-task >Reporter: Lewis John McGibbney >Priority: Major > > Follow the guidance at > [https://www.baeldung.com/java-migrate-8-to-17#1-compact-string] > It looks like there are 9 instances where we use _*char []*_ > |[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-1806) Delegate processing of URL domains to crawler commons
[ https://issues.apache.org/jira/browse/NUTCH-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841995#comment-17841995 ] ASF GitHub Bot commented on NUTCH-1806: --- sebastian-nagel opened a new pull request, #816: URL: https://github.com/apache/nutch/pull/816 and NUTCH-1942 Remove TopLevelDomain - use methods from crawler-commons' EffectiveTldFinder in URLUtil replacing classed and methods from the "org.apache.nutch.util.domain" package - adapt and extend unit tests - add tests for URLUtil.getTopLevelDomainName(url) - reflect changes to the public suffix list since 2014 ("xyz" is now a public suffix / ICANN suffix) - adapt to minor API changes - URLUtil.getDomainName(url) returns the host name in case no valid public suffix is found - for Unicode suffixes and TLDs the methods URLUtil.getDomainSuffix(url) resp. URLUtil.getTopLevelDomainName(url) now return the ASCII representation - add unit tests for host names with trailing dot ("www.apache.org.") - add add unit test for URLs without host/domain (cf. NUTCH-2450)unit test for URLs without host/domain (cf. NUTCH-2450) - update and complete Javadoc - update DomainStatistics, TLDIndexingFilter and domain URL filters to use the updated methods in URLUtil - remove the class TLDScoringFilter. The configuration is bound to the domain-suffixes.xml which wasn't maintained anymore and is now removed - remove package org.apache.nutch.util.domain - move DomainStatistics to org.apache.nutch.util - remove configuration files of domain utils > Delegate processing of URL domains to crawler commons > - > > Key: NUTCH-1806 > URL: https://issues.apache.org/jira/browse/NUTCH-1806 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.8 >Reporter: Julien Nioche >Priority: Major > Labels: crawler-commons > Fix For: 1.21 > > > We have code in src/java/org/apache/nutch/util/domain and a resource file > conf/domain-suffixes.xml to handle URL domains. This is used mostly from > URLUtil.getDomainName. > The resource file is not necessarily up to date and since crawler commons has > a similar functionality we should use it instead of having to maintain our > own resources. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3046) Use compact strings
Lewis John McGibbney created NUTCH-3046: --- Summary: Use compact strings Key: NUTCH-3046 URL: https://issues.apache.org/jira/browse/NUTCH-3046 Project: Nutch Issue Type: Sub-task Reporter: Lewis John McGibbney Follow the guidance at [https://www.baeldung.com/java-migrate-8-to-17#1-compact-string] It looks like there are [9 instances where we use char[]|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3045) Upgrade from Java 11 to 17
Lewis John McGibbney created NUTCH-3045: --- Summary: Upgrade from Java 11 to 17 Key: NUTCH-3045 URL: https://issues.apache.org/jira/browse/NUTCH-3045 Project: Nutch Issue Type: Task Components: build, ci/cd Reporter: Lewis John McGibbney Fix For: 1.21 This parent issue will track and organize work pertaining to upgrading Nutch to JDK 17. Premier support for Oracle JDK 11 ended 7 months ago (30 Sep 2023). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails
[ https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841682#comment-17841682 ] ASF GitHub Bot commented on NUTCH-3044: --- lewismc commented on PR #815: URL: https://github.com/apache/nutch/pull/815#issuecomment-2081564107 Excellent @sebastian-nagel +1 > Generator: NPE when extracting the host part of a URL fails > --- > > Key: NUTCH-3044 > URL: https://issues.apache.org/jira/browse/NUTCH-3044 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > When extracting the host part of a URL fails, the Generator job fails because > of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb > contains an malformed URL, for example, a URL with an unsupported scheme > (smb://). > {noformat} > Caused by: java.lang.NullPointerException > at > org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439) > at > org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841681#comment-17841681 ] ASF GitHub Bot commented on NUTCH-3043: --- lewismc commented on PR #814: URL: https://github.com/apache/nutch/pull/814#issuecomment-2081563229 Excellent @sebastian-nagel > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails
[ https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841481#comment-17841481 ] ASF GitHub Bot commented on NUTCH-3044: --- sebastian-nagel commented on PR #815: URL: https://github.com/apache/nutch/pull/815#issuecomment-2080743831 ... also fixed the Javadoc error. > Generator: NPE when extracting the host part of a URL fails > --- > > Key: NUTCH-3044 > URL: https://issues.apache.org/jira/browse/NUTCH-3044 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > When extracting the host part of a URL fails, the Generator job fails because > of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb > contains an malformed URL, for example, a URL with an unsupported scheme > (smb://). > {noformat} > Caused by: java.lang.NullPointerException > at > org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439) > at > org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841472#comment-17841472 ] ASF GitHub Bot commented on NUTCH-3043: --- sebastian-nagel commented on PR #814: URL: https://github.com/apache/nutch/pull/814#issuecomment-2080634329 Hi @lewismc: - "use parameterized logging": done - "augment the [metrics documentation](https://cwiki.apache.org/confluence/display/NUTCH/Metrics) once this is merged.": will do - "we could also [create a test for the counters](https://cwiki.apache.org/confluence/display/MRUNIT/MRUnit+Tutorial#MRUnitTutorial-TestingCounters).": for now, TestGenerator is not based on MRUNIT. The various Generator::generate(...) return the number of generated segments without a way to access the counters (they're logged, however). I'd prefer to track this in a separate issue, because it would require to many code changes to read the counters. > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails
[ https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841470#comment-17841470 ] ASF GitHub Bot commented on NUTCH-3044: --- sebastian-nagel commented on PR #815: URL: https://github.com/apache/nutch/pull/815#issuecomment-2080603546 > we could provide a TestGenerator#testNullHostInReducer test case Good idea! Done, see 4729786. > Generator: NPE when extracting the host part of a URL fails > --- > > Key: NUTCH-3044 > URL: https://issues.apache.org/jira/browse/NUTCH-3044 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > When extracting the host part of a URL fails, the Generator job fails because > of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb > contains an malformed URL, for example, a URL with an unsupported scheme > (smb://). > {noformat} > Caused by: java.lang.NullPointerException > at > org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439) > at > org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840892#comment-17840892 ] ASF GitHub Bot commented on NUTCH-3043: --- lewismc commented on code in PR #814: URL: https://github.com/apache/nutch/pull/814#discussion_r1579883313 ## src/java/org/apache/nutch/crawl/Generator.java: ## @@ -253,10 +256,7 @@ public void map(Text key, CrawlDatum value, Context context) try { sort = scfilters.generatorSortValue(key, crawlDatum, sort); } catch (ScoringFilterException sfe) { -if (LOG.isWarnEnabled()) { - LOG.warn( - "Couldn't filter generatorSortValue for " + key + ": " + sfe); -} +LOG.warn("Couldn't filter generatorSortValue for " + key + ": " + sfe); Review Comment: Please use parameterized logging. ``` LOG.warn("Couldn't filter generatorSortValue for {}: {}”, key, sfe); ``` > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails
[ https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840854#comment-17840854 ] ASF GitHub Bot commented on NUTCH-3044: --- sebastian-nagel opened a new pull request, #815: URL: https://github.com/apache/nutch/pull/815 (no comment) > Generator: NPE when extracting the host part of a URL fails > --- > > Key: NUTCH-3044 > URL: https://issues.apache.org/jira/browse/NUTCH-3044 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > When extracting the host part of a URL fails, the Generator job fails because > of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb > contains an malformed URL, for example, a URL with an unsupported scheme > (smb://). > {noformat} > Caused by: java.lang.NullPointerException > at > org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439) > at > org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails
Sebastian Nagel created NUTCH-3044: -- Summary: Generator: NPE when extracting the host part of a URL fails Key: NUTCH-3044 URL: https://issues.apache.org/jira/browse/NUTCH-3044 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.20 Reporter: Sebastian Nagel Fix For: 1.21 When extracting the host part of a URL fails, the Generator job fails because of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb contains an malformed URL, for example, a URL with an unsupported scheme (smb://). {noformat} Caused by: java.lang.NullPointerException at org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439) at org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840845#comment-17840845 ] ASF GitHub Bot commented on NUTCH-3043: --- sebastian-nagel opened a new pull request, #814: URL: https://github.com/apache/nutch/pull/814 - add counters URL_FILTERS_REJECTED and URL_FILTER_EXCEPTION - simplify logging statement - remove unnecessary cast > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3043) Generator: count URLs rejected by URL filters
Sebastian Nagel created NUTCH-3043: -- Summary: Generator: count URLs rejected by URL filters Key: NUTCH-3043 URL: https://issues.apache.org/jira/browse/NUTCH-3043 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.20 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Fix For: 1.21 Generator already counts URLs rejected by the (re)fetch scheduler, by fetch interval or status. It should also count the number of URLs rejected by URL filters. See also [Generator metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
[ https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839186#comment-17839186 ] ASF GitHub Bot commented on NUTCH-3041: --- lewismc commented on PR #813: URL: https://github.com/apache/nutch/pull/813#issuecomment-2067543713 The logging now looks as follows ```INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] Found 1 URLExemptionFilter implementations: '[org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter@3090c372]’```. If no URLExemptionFilter implementations are found then no log statement is produced. > Address confusing logging in o.a.n.net.URLExemptionFilters > --- > > Key: NUTCH-3041 > URL: https://issues.apache.org/jira/browse/NUTCH-3041 > Project: Nutch > Issue Type: Task > Components: net >Affects Versions: 1.19, 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > URLExemptionFilter impementations are used to allow exemptions to external > domain resources by overriding the {{db.ignore.external.links}} configuration > setting. This is useful when the crawl is focused to a domain but resources > like images are hosted on CDN. > Currently [URLExemptionFilters|#L47-L48]] provides the following logging > {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor > #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' > {quote} > I find this confusing. It would be better to log *only* if an > URLExemptionFilter implementation is actually configured to be used at > runtime. > I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3042) Use GitHub cache action to improve CI execution time
[ https://issues.apache.org/jira/browse/NUTCH-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3042: Description: With the Ant+Ivy build architecture, the current GitHub actions workflow can and regularly does take over 20 minutes to complete. Dependency retrieval takes a significant amount of time. I think we can address the above issue and dramatically reduce the CI runtime by utilizing the official [GitHiub cache action|[https://github.com/actions/cache]]. It appears however that the action does not support the Apache Ivy cache. Both Maven and Gradle are supported. I [created a discussion|[https://github.com/actions/cache/discussions/1381]] to get conformation. In the case that we cannot implement a cache for the Ivy build system then we will need to come back to this issue once we migrate to Gradle. was: With the Ant+Ivy build architecture, the current GitHub actions workflow can and regularly does take over 20 minutes to complete. Dependency retrieval takes a significant amount of time. I think we can address the above issue and dramatically reduce the CI runtime by utilizing the official [GitHiub cache action|[https://github.com/actions/cache]]. It appears however that the action does not support the Apache Ivy cache. Both Maven and Gradle are supported. I created a discussion to get conformation if this is the case. In the case that we cannot implement a cache for the Ivy build system then we will need to come back to this issue once we migrate to Gradle. > Use GitHub cache action to improve CI execution time > > > Key: NUTCH-3042 > URL: https://issues.apache.org/jira/browse/NUTCH-3042 > Project: Nutch > Issue Type: Task > Components: ci/cd >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.21 > > > With the Ant+Ivy build architecture, the current GitHub actions workflow can > and regularly does take over 20 minutes to complete. Dependency retrieval > takes a significant amount of time. > I think we can address the above issue and dramatically reduce the CI runtime > by utilizing the official [GitHiub cache > action|[https://github.com/actions/cache]]. > It appears however that the action does not support the Apache Ivy cache. > Both Maven and Gradle are supported. I [created a > discussion|[https://github.com/actions/cache/discussions/1381]] to get > conformation. > In the case that we cannot implement a cache for the Ivy build system then we > will need to come back to this issue once we migrate to Gradle. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3042) Use GitHub cache action to improve CI execution time
Lewis John McGibbney created NUTCH-3042: --- Summary: Use GitHub cache action to improve CI execution time Key: NUTCH-3042 URL: https://issues.apache.org/jira/browse/NUTCH-3042 Project: Nutch Issue Type: Task Components: ci/cd Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.21 With the Ant+Ivy build architecture, the current GitHub actions workflow can and regularly does take over 20 minutes to complete. Dependency retrieval takes a significant amount of time. I think we can address the above issue and dramatically reduce the CI runtime by utilizing the official [GitHiub cache action|[https://github.com/actions/cache]]. It appears however that the action does not support the Apache Ivy cache. Both Maven and Gradle are supported. I created a discussion to get conformation if this is the case. In the case that we cannot implement a cache for the Ivy build system then we will need to come back to this issue once we migrate to Gradle. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work started] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
[ https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3041 started by Lewis John McGibbney. --- > Address confusing logging in o.a.n.net.URLExemptionFilters > --- > > Key: NUTCH-3041 > URL: https://issues.apache.org/jira/browse/NUTCH-3041 > Project: Nutch > Issue Type: Task > Components: net >Affects Versions: 1.19, 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > URLExemptionFilter impementations are used to allow exemptions to external > domain resources by overriding the {{db.ignore.external.links}} configuration > setting. This is useful when the crawl is focused to a domain but resources > like images are hosted on CDN. > Currently [URLExemptionFilters|#L47-L48]] provides the following logging > {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor > #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' > {quote} > I find this confusing. It would be better to log *only* if an > URLExemptionFilter implementation is actually configured to be used at > runtime. > I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
[ https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839181#comment-17839181 ] ASF GitHub Bot commented on NUTCH-3041: --- lewismc opened a new pull request, #813: URL: https://github.com/apache/nutch/pull/813 PR to address https://issues.apache.org/jira/browse/NUTCH-3041 > Address confusing logging in o.a.n.net.URLExemptionFilters > --- > > Key: NUTCH-3041 > URL: https://issues.apache.org/jira/browse/NUTCH-3041 > Project: Nutch > Issue Type: Task > Components: net >Affects Versions: 1.19, 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > URLExemptionFilter impementations are used to allow exemptions to external > domain resources by overriding the {{db.ignore.external.links}} configuration > setting. This is useful when the crawl is focused to a domain but resources > like images are hosted on CDN. > Currently [URLExemptionFilters|#L47-L48]] provides the following logging > {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor > #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' > {quote} > I find this confusing. It would be better to log *only* if an > URLExemptionFilter implementation is actually configured to be used at > runtime. > I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
[ https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3041: Description: URLExemptionFilter impementations are used to allow exemptions to external domain resources by overriding the {{db.ignore.external.links}} configuration setting. This is useful when the crawl is focused to a domain but resources like images are hosted on CDN. Currently [URLExemptionFilters|#L47-L48]] provides the following logging {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' {quote} I find this confusing. It would be better to log *only* if an URLExemptionFilter implementation is actually configured to be used at runtime. I will provide a patch for this. was: URLExemptionFilter impementations are used to allow exemptions to external domain resources by overriding the {{db.ignore.external.links}} configuration setting. This is useful when the crawl is focused to a domain but resources like images are hosted on CDN. Currently [URLExemptionFilters|#L47-L48]] provides the following logging {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' {quote} I find this confusing. It would be better to log *only* if an URLExemptionFilter implementation actually exists for a given URL. I will provide a patch for this. > Address confusing logging in o.a.n.net.URLExemptionFilters > --- > > Key: NUTCH-3041 > URL: https://issues.apache.org/jira/browse/NUTCH-3041 > Project: Nutch > Issue Type: Task > Components: net >Affects Versions: 1.19, 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > URLExemptionFilter impementations are used to allow exemptions to external > domain resources by overriding the {{db.ignore.external.links}} configuration > setting. This is useful when the crawl is focused to a domain but resources > like images are hosted on CDN. > Currently [URLExemptionFilters|#L47-L48]] provides the following logging > {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor > #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' > {quote} > I find this confusing. It would be better to log *only* if an > URLExemptionFilter implementation is actually configured to be used at > runtime. > I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
[ https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3041: Description: URLExemptionFilter impementations are used to allow exemptions to external domain resources by overriding the {{db.ignore.external.links}} configuration setting. This is useful when the crawl is focused to a domain but resources like images are hosted on CDN. Currently [URLExemptionFilters|#L47-L48]] provides the following logging {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' {quote} I find this confusing. It would be better to log *only* if an URLExemptionFilter implementation actually exists for a given URL. I will provide a patch for this. was: URLExemptionFilter impementations are used to allow exemptions to external domain resources by overriding the {{db.ignore.external.links}} configuration setting. This is useful when the crawl is focused to a domain but resources like images are hosted on CDN. Currently [URLExemptionFilters|[https://github.com/apache/nutch/blob/271f92e11c39b7a3583cfcd8d664262cfac59674/src/java/org/apache/nutch/net/URLExemptionFilters.java#L47-L48]] provides some confusing INFO-level logging {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' {quote} I find this confusing. It would be better to log *only* if an URLExemptionFilter implementation actually exists for a given URL. I will provide a patch for this. > Address confusing logging in o.a.n.net.URLExemptionFilters > --- > > Key: NUTCH-3041 > URL: https://issues.apache.org/jira/browse/NUTCH-3041 > Project: Nutch > Issue Type: Task > Components: net >Affects Versions: 1.19, 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > URLExemptionFilter impementations are used to allow exemptions to external > domain resources by overriding the {{db.ignore.external.links}} configuration > setting. This is useful when the crawl is focused to a domain but resources > like images are hosted on CDN. > Currently [URLExemptionFilters|#L47-L48]] provides the following logging > {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor > #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' > {quote} > I find this confusing. It would be better to log *only* if an > URLExemptionFilter implementation actually exists for a given URL. > I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
Lewis John McGibbney created NUTCH-3041: --- Summary: Address confusing logging in o.a.n.net.URLExemptionFilters Key: NUTCH-3041 URL: https://issues.apache.org/jira/browse/NUTCH-3041 Project: Nutch Issue Type: Task Components: net Affects Versions: 1.19, 1.20 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.21 URLExemptionFilter impementations are used to allow exemptions to external domain resources by overriding the {{db.ignore.external.links}} configuration setting. This is useful when the crawl is focused to a domain but resources like images are hosted on CDN. Currently [URLExemptionFilters|[https://github.com/apache/nutch/blob/271f92e11c39b7a3583cfcd8d664262cfac59674/src/java/org/apache/nutch/net/URLExemptionFilters.java#L47-L48]] provides some confusing INFO-level logging {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' {quote} I find this confusing. It would be better to log *only* if an URLExemptionFilter implementation actually exists for a given URL. I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3040) Upgrade to Hadoop 3.4.0
[ https://issues.apache.org/jira/browse/NUTCH-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836191#comment-17836191 ] Tim Allison commented on NUTCH-3040: :cry-sob: This is great news! > Upgrade to Hadoop 3.4.0 > --- > > Key: NUTCH-3040 > URL: https://issues.apache.org/jira/browse/NUTCH-3040 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.21 > > > [Hadoop 3.4.0|https://hadoop.apache.org/release/3.4.0.html] has been released. > Many dependencies are upgraded, including commons-io 2.14.0 which would have > saved us a lot of work in NUTCH-2959. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3040) Upgrade to Hadoop 3.4.0
Sebastian Nagel created NUTCH-3040: -- Summary: Upgrade to Hadoop 3.4.0 Key: NUTCH-3040 URL: https://issues.apache.org/jira/browse/NUTCH-3040 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.20 Reporter: Sebastian Nagel Fix For: 1.21 [Hadoop 3.4.0|https://hadoop.apache.org/release/3.4.0.html] has been released. Many dependencies are upgraded, including commons-io 2.14.0 which would have saved us a lot of work in NUTCH-2959. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3039) Failure to handle ftp:// URLs
[ https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836133#comment-17836133 ] Markus Jelsma commented on NUTCH-3039: -- Thanks for spotting that! > Failure to handle ftp:// URLs > - > > Key: NUTCH-3039 > URL: https://issues.apache.org/jira/browse/NUTCH-3039 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.21 > > > Nutch fails to handle ftp:// URLs: > - URLNormalizerBasic returns the empty string because creating the URL > instance fails with a MalformedURLException: > {noformat} > echo "ftp://ftp.example.com/path/file.txt; \ > | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat} > - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due > to a MalformedURLException: > {noformat} > bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \ >"ftp://ftp.example.com/path/file.txt; > ... > Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: > java.net.MalformedURLException > at > org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113) > ...{noformat} > The issue is caused by NUTCH-2429: > - we do not provide a dedicated URL stream handler for ftp URLs > - but also do not pass ftp:// URLs to the standard JVM handler -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3039) Failure to handle ftp:// URLs
[ https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836126#comment-17836126 ] ASF GitHub Bot commented on NUTCH-3039: --- sebastian-nagel opened a new pull request, #812: URL: https://github.com/apache/nutch/pull/812 Pass ftp:// URLs to the standard JVM URLStreamHandler > Failure to handle ftp:// URLs > - > > Key: NUTCH-3039 > URL: https://issues.apache.org/jira/browse/NUTCH-3039 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.21 > > > Nutch fails to handle ftp:// URLs: > - URLNormalizerBasic returns the empty string because creating the URL > instance fails with a MalformedURLException: > {noformat} > echo "ftp://ftp.example.com/path/file.txt; \ > | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat} > - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due > to a MalformedURLException: > {noformat} > bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \ >"ftp://ftp.example.com/path/file.txt; > ... > Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: > java.net.MalformedURLException > at > org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113) > ...{noformat} > The issue is caused by NUTCH-2429: > - we do not provide a dedicated URL stream handler for ftp URLs > - but also do not pass ftp:// URLs to the standard JVM handler -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (NUTCH-3039) Failure to handle ftp:// URLs
[ https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-3039: -- Assignee: Sebastian Nagel > Failure to handle ftp:// URLs > - > > Key: NUTCH-3039 > URL: https://issues.apache.org/jira/browse/NUTCH-3039 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.21 > > > Nutch fails to handle ftp:// URLs: > - URLNormalizerBasic returns the empty string because creating the URL > instance fails with a MalformedURLException: > {noformat} > echo "ftp://ftp.example.com/path/file.txt; \ > | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat} > - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due > to a MalformedURLException: > {noformat} > bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \ >"ftp://ftp.example.com/path/file.txt; > ... > Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: > java.net.MalformedURLException > at > org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113) > ...{noformat} > The issue is caused by NUTCH-2429: > - we do not provide a dedicated URL stream handler for ftp URLs > - but also do not pass ftp:// URLs to the standard JVM handler -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3039) Failure to handle ftp:// URLs
Sebastian Nagel created NUTCH-3039: -- Summary: Failure to handle ftp:// URLs Key: NUTCH-3039 URL: https://issues.apache.org/jira/browse/NUTCH-3039 Project: Nutch Issue Type: Bug Components: plugin, protocol Affects Versions: 1.19 Reporter: Sebastian Nagel Fix For: 1.21 Nutch fails to handle ftp:// URLs: - URLNormalizerBasic returns the empty string because creating the URL instance fails with a MalformedURLException: {noformat} echo "ftp://ftp.example.com/path/file.txt; \ | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat} - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due to a MalformedURLException: {noformat} bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \ "ftp://ftp.example.com/path/file.txt; ... Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException at org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113) ...{noformat} The issue is caused by NUTCH-2429: - we do not provide a dedicated URL stream handler for ftp URLs - but also do not pass ftp:// URLs to the standard JVM handler -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun
[ https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835083#comment-17835083 ] Hudson commented on NUTCH-3038: --- SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #157 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/157/]) NUTCH-3038 Address issues discovered during 1.20 release management dryrun (#811) (github: [https://github.com/apache/nutch/commit/271f92e11c39b7a3583cfcd8d664262cfac59674]) * (edit) ivy/mvn.template * (add) CHANGES.md * (delete) CHANGES.txt * (edit) build.xml * (edit) docker/Dockerfile * (edit) docker/README.md > Address issues discovered during 1.20 release management dryrun > --- > > Key: NUTCH-3038 > URL: https://issues.apache.org/jira/browse/NUTCH-3038 > Project: Nutch > Issue Type: Task > Components: build, docker >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 1.20 > > > During the 1.20 release management dryrun I discovered the following issues > which I think should be addressed in order to be satisfied with the release > candidate > # Update docker/README to remove broken badge > # Upgrade alpine base image in docker/Dockerfile > # Migrate CHANGES.txt to CHANGES.md > # Upgrade apache parent pom version from 23 to 31 > # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml > # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in > ivy/mvn.template > # Remove miredot plugin usage from ivy/mvn.template -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun
[ https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835078#comment-17835078 ] ASF GitHub Bot commented on NUTCH-3038: --- lewismc merged PR #811: URL: https://github.com/apache/nutch/pull/811 > Address issues discovered during 1.20 release management dryrun > --- > > Key: NUTCH-3038 > URL: https://issues.apache.org/jira/browse/NUTCH-3038 > Project: Nutch > Issue Type: Task > Components: build, docker >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 1.20 > > > During the 1.20 release management dryrun I discovered the following issues > which I think should be addressed in order to be satisfied with the release > candidate > # Update docker/README to remove broken badge > # Upgrade alpine base image in docker/Dockerfile > # Migrate CHANGES.txt to CHANGES.md > # Upgrade apache parent pom version from 23 to 31 > # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml > # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in > ivy/mvn.template > # Remove miredot plugin usage from ivy/mvn.template -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun
[ https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-3038. - Resolution: Fixed > Address issues discovered during 1.20 release management dryrun > --- > > Key: NUTCH-3038 > URL: https://issues.apache.org/jira/browse/NUTCH-3038 > Project: Nutch > Issue Type: Task > Components: build, docker >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 1.20 > > > During the 1.20 release management dryrun I discovered the following issues > which I think should be addressed in order to be satisfied with the release > candidate > # Update docker/README to remove broken badge > # Upgrade alpine base image in docker/Dockerfile > # Migrate CHANGES.txt to CHANGES.md > # Upgrade apache parent pom version from 23 to 31 > # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml > # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in > ivy/mvn.template > # Remove miredot plugin usage from ivy/mvn.template -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun
[ https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-3038. --- Thanks [~snagel] > Address issues discovered during 1.20 release management dryrun > --- > > Key: NUTCH-3038 > URL: https://issues.apache.org/jira/browse/NUTCH-3038 > Project: Nutch > Issue Type: Task > Components: build, docker >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 1.20 > > > During the 1.20 release management dryrun I discovered the following issues > which I think should be addressed in order to be satisfied with the release > candidate > # Update docker/README to remove broken badge > # Upgrade alpine base image in docker/Dockerfile > # Migrate CHANGES.txt to CHANGES.md > # Upgrade apache parent pom version from 23 to 31 > # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml > # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in > ivy/mvn.template > # Remove miredot plugin usage from ivy/mvn.template -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work stopped] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun
[ https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3038 stopped by Lewis John McGibbney. --- > Address issues discovered during 1.20 release management dryrun > --- > > Key: NUTCH-3038 > URL: https://issues.apache.org/jira/browse/NUTCH-3038 > Project: Nutch > Issue Type: Task > Components: build, docker >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 1.20 > > > During the 1.20 release management dryrun I discovered the following issues > which I think should be addressed in order to be satisfied with the release > candidate > # Update docker/README to remove broken badge > # Upgrade alpine base image in docker/Dockerfile > # Migrate CHANGES.txt to CHANGES.md > # Upgrade apache parent pom version from 23 to 31 > # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml > # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in > ivy/mvn.template > # Remove miredot plugin usage from ivy/mvn.template -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode
[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834532#comment-17834532 ] Tim Allison commented on NUTCH-2937: I really, really, really wish we didn't have to do this! :P Happy to help! > parse-tika: review dependency exclusions and avoid dependency conflicts in > distributed mode > --- > > Key: NUTCH-2937 > URL: https://issues.apache.org/jira/browse/NUTCH-2937 > Project: Nutch > Issue Type: Bug > Components: parser, plugin >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Tim Allison >Priority: Major > Fix For: 1.20 > > > While testing NUTCH-2919 I've seen the following error caused by a > conflicting dependency to commons-io: > - 2.11.0 Nutch core > - 2.11.0 parse-tika (excluded to avoid duplicated dependencies) > - 2.5 provided by Hadoop > This causes errors parsing some office and other documents (but not all), for > example: > {noformat} > 2022-01-15 01:36:31,365 WARN [FetcherThread] > org.apache.nutch.parse.ParseUtil: Error parsing > http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92) > at > org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431) > Caused by: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) > at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode
[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2937. Resolution: Fixed Fixed NUTCH-2959 by using the shaded Tika package. Thanks, [~tallison]! > parse-tika: review dependency exclusions and avoid dependency conflicts in > distributed mode > --- > > Key: NUTCH-2937 > URL: https://issues.apache.org/jira/browse/NUTCH-2937 > Project: Nutch > Issue Type: Bug > Components: parser, plugin >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > While testing NUTCH-2919 I've seen the following error caused by a > conflicting dependency to commons-io: > - 2.11.0 Nutch core > - 2.11.0 parse-tika (excluded to avoid duplicated dependencies) > - 2.5 provided by Hadoop > This causes errors parsing some office and other documents (but not all), for > example: > {noformat} > 2022-01-15 01:36:31,365 WARN [FetcherThread] > org.apache.nutch.parse.ParseUtil: Error parsing > http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92) > at > org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431) > Caused by: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) > at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode
[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2937: -- Assignee: Tim Allison > parse-tika: review dependency exclusions and avoid dependency conflicts in > distributed mode > --- > > Key: NUTCH-2937 > URL: https://issues.apache.org/jira/browse/NUTCH-2937 > Project: Nutch > Issue Type: Bug > Components: parser, plugin >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Tim Allison >Priority: Major > Fix For: 1.20 > > > While testing NUTCH-2919 I've seen the following error caused by a > conflicting dependency to commons-io: > - 2.11.0 Nutch core > - 2.11.0 parse-tika (excluded to avoid duplicated dependencies) > - 2.5 provided by Hadoop > This causes errors parsing some office and other documents (but not all), for > example: > {noformat} > 2022-01-15 01:36:31,365 WARN [FetcherThread] > org.apache.nutch.parse.ParseUtil: Error parsing > http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92) > at > org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431) > Caused by: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) > at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode
[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2937: --- Fix Version/s: 1.20 (was: 1.21) > parse-tika: review dependency exclusions and avoid dependency conflicts in > distributed mode > --- > > Key: NUTCH-2937 > URL: https://issues.apache.org/jira/browse/NUTCH-2937 > Project: Nutch > Issue Type: Bug > Components: parser, plugin >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > While testing NUTCH-2919 I've seen the following error caused by a > conflicting dependency to commons-io: > - 2.11.0 Nutch core > - 2.11.0 parse-tika (excluded to avoid duplicated dependencies) > - 2.5 provided by Hadoop > This causes errors parsing some office and other documents (but not all), for > example: > {noformat} > 2022-01-15 01:36:31,365 WARN [FetcherThread] > org.apache.nutch.parse.ParseUtil: Error parsing > http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92) > at > org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431) > Caused by: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) > at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3005) Upgrade selenium as needed
[ https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3005. Resolution: Implemented Done by [~lewismc] as part of NUTCH-3036, commit [1563396|https://github.com/apache/nutch/blob/1563396d952393462fffab1f686e9ffd5d006cf6/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java#L151] . > Upgrade selenium as needed > -- > > Key: NUTCH-3005 > URL: https://issues.apache.org/jira/browse/NUTCH-3005 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Tim Allison >Priority: Trivial > Fix For: 1.20 > > > When we choose to upgrade selenium, we should take note of this blog about > changes in headless chromium: > https://www.selenium.dev/blog/2023/headless-is-going-away/ > ChromeOptions options = new ChromeOptions(); > options.addArguments("--headless=new"); > WebDriver driver = new ChromeDriver(options); > driver.get("https://selenium.dev;); > driver.quit(); -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2
[ https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3016. Resolution: Duplicate > Upgrade Apache Ivy to 2.5.2 > --- > > Key: NUTCH-3016 > URL: https://issues.apache.org/jira/browse/NUTCH-3016 > Project: Nutch > Issue Type: Task > Components: build, ivy >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > [Apache Ivy > v2.5.2|https://ant.apache.org/ivy/history/2.5.2/release-notes.html] was > released on August 20 2023! > We should upgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2
[ https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3016: --- Fix Version/s: 1.20 (was: 1.21) > Upgrade Apache Ivy to 2.5.2 > --- > > Key: NUTCH-3016 > URL: https://issues.apache.org/jira/browse/NUTCH-3016 > Project: Nutch > Issue Type: Task > Components: build, ivy >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > [Apache Ivy > v2.5.2|https://ant.apache.org/ivy/history/2.5.2/release-notes.html] was > released on August 20 2023! > We should upgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3005) Upgrade selenium as needed
[ https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3005: --- Affects Version/s: 1.19 > Upgrade selenium as needed > -- > > Key: NUTCH-3005 > URL: https://issues.apache.org/jira/browse/NUTCH-3005 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Tim Allison >Priority: Trivial > > When we choose to upgrade selenium, we should take note of this blog about > changes in headless chromium: > https://www.selenium.dev/blog/2023/headless-is-going-away/ > ChromeOptions options = new ChromeOptions(); > options.addArguments("--headless=new"); > WebDriver driver = new ChromeDriver(options); > driver.get("https://selenium.dev;); > driver.quit(); -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3005) Upgrade selenium as needed
[ https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3005: --- Fix Version/s: 1.20 > Upgrade selenium as needed > -- > > Key: NUTCH-3005 > URL: https://issues.apache.org/jira/browse/NUTCH-3005 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Tim Allison >Priority: Trivial > Fix For: 1.20 > > > When we choose to upgrade selenium, we should take note of this blog about > changes in headless chromium: > https://www.selenium.dev/blog/2023/headless-is-going-away/ > ChromeOptions options = new ChromeOptions(); > options.addArguments("--headless=new"); > WebDriver driver = new ChromeDriver(options); > driver.get("https://selenium.dev;); > driver.quit(); -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3028: --- Affects Version/s: 1.19 > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3028: --- Fix Version/s: 1.21 > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun
[ https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834481#comment-17834481 ] ASF GitHub Bot commented on NUTCH-3038: --- lewismc opened a new pull request, #811: URL: https://github.com/apache/nutch/pull/811 PR for https://issues.apache.org/jira/browse/NUTCH-3038 > Address issues discovered during 1.20 release management dryrun > --- > > Key: NUTCH-3038 > URL: https://issues.apache.org/jira/browse/NUTCH-3038 > Project: Nutch > Issue Type: Task > Components: build, docker >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 1.20 > > > During the 1.20 release management dryrun I discovered the following issues > which I think should be addressed in order to be satisfied with the release > candidate > # Update docker/README to remove broken badge > # Upgrade alpine base image in docker/Dockerfile > # Migrate CHANGES.txt to CHANGES.md > # Upgrade apache parent pom version from 23 to 31 > # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml > # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in > ivy/mvn.template > # Remove miredot plugin usage from ivy/mvn.template -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work started] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun
[ https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3038 started by Lewis John McGibbney. --- > Address issues discovered during 1.20 release management dryrun > --- > > Key: NUTCH-3038 > URL: https://issues.apache.org/jira/browse/NUTCH-3038 > Project: Nutch > Issue Type: Task > Components: build, docker >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 1.20 > > > During the 1.20 release management dryrun I discovered the following issues > which I think should be addressed in order to be satisfied with the release > candidate > # Update docker/README to remove broken badge > # Upgrade alpine base image in docker/Dockerfile > # Migrate CHANGES.txt to CHANGES.md > # Upgrade apache parent pom version from 23 to 31 > # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml > # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in > ivy/mvn.template > # Remove miredot plugin usage from ivy/mvn.template -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun
[ https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3038: Description: During the 1.20 release management dryrun I discovered the following issues which I think should be addressed in order to be satisfied with the release candidate # Update docker/README to remove broken badge # Upgrade alpine base image in docker/Dockerfile # Migrate CHANGES.txt to CHANGES.md # Upgrade apache parent pom version from 23 to 31 # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in ivy/mvn.template # Remove miredot plugin usage from ivy/mvn.template was: During the 1.20 release management dryrun I discovered the following issues which I think should be addressed in order to be satisfied with the release candidate # Update docker/README to remove broken badge # Upgrade alpine base image in docker/Dockerfile # Migrate CHANGES.txt to CHANGES.md # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in ivy/mvn.template # Remove miredot plugin usage from ivy/mvn.template > Address issues discovered during 1.20 release management dryrun > --- > > Key: NUTCH-3038 > URL: https://issues.apache.org/jira/browse/NUTCH-3038 > Project: Nutch > Issue Type: Task > Components: build, docker >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 1.20 > > > During the 1.20 release management dryrun I discovered the following issues > which I think should be addressed in order to be satisfied with the release > candidate > # Update docker/README to remove broken badge > # Upgrade alpine base image in docker/Dockerfile > # Migrate CHANGES.txt to CHANGES.md > # Upgrade apache parent pom version from 23 to 31 > # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml > # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in > ivy/mvn.template > # Remove miredot plugin usage from ivy/mvn.template -- This message was sent by Atlassian Jira (v8.20.10#820010)