from:"Sebastian Nagel \(JIRA\)"

[jira] [Resolved] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-05-14 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3043.

Resolution: Implemented

> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3039) Failure to handle ftp:// URLs

2024-05-14 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3039.

Resolution: Fixed

> Failure to handle ftp:// URLs
> -
>
> Key: NUTCH-3039
> URL: https://issues.apache.org/jira/browse/NUTCH-3039
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Nutch fails to handle ftp:// URLs:
> - URLNormalizerBasic returns the empty string because creating the URL 
> instance fails with a MalformedURLException:
>   {noformat}
> echo "ftp://ftp.example.com/path/file.txt; \
>   | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
> - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due 
> to a MalformedURLException:
>   {noformat}
> bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
>"ftp://ftp.example.com/path/file.txt;
> ...
> Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
> java.net.MalformedURLException
> at 
> org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
> ...{noformat}
> The issue is caused by NUTCH-2429:
> - we do not provide a dedicated URL stream handler for ftp URLs
> - but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3055) README: fix Github "hub" commands

2024-04-30 Thread Sebastian Nagel (Jira)

Sebastian Nagel created NUTCH-3055:
--

 Summary: README: fix Github "hub" commands
 Key: NUTCH-3055
 URL: https://issues.apache.org/jira/browse/NUTCH-3055
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.20
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21


The [README.md|https://github.com/apache/nutch/blob/master/README.md] contains 
[Github hub|https://hub.github.com/] commands but with "git" as command 
(executable) name, maybe an alias or some other magic. However, if hub isn't 
installed, these commands fail with {{git: 'pull-request' is not a git command. 
See 'git --help'.}} or similar.

We should use the command "hub" instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842291#comment-17842291
 ] 

Sebastian Nagel commented on NUTCH-3028:


+1 lgtm.

One question: if there is no parseData, the JEXL expression is not evaluated. 
Since WARC files may inlcude only the raw HTML plus fetch/capture metadata, 
successfully parsing a document is not a requirement to archive it in a WARC 
file. Might be useful to have the JEXL filtering also available for unparsed 
docs.

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3045) Upgrade from Java 11 to 17

2024-04-30 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842284#comment-17842284
 ] 

Sebastian Nagel commented on NUTCH-3045:


See also NUTCH-2987. Until HADOOP-17177 / HADOOP-18887 are done, we might be 
forced to upkeep JDK 11 runtime compatibility, so that Nutch runs on recent 
Hadoop versions and distributions. I fully agree that Java 17 offers some nice 
syntax improvements, though. :)

> Upgrade from Java 11 to 17
> --
>
> Key: NUTCH-3045
> URL: https://issues.apache.org/jira/browse/NUTCH-3045
> Project: Nutch
>  Issue Type: Task
>  Components: build, ci/cd
>Reporter: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.21
>
>
> This parent issue will track and organize work pertaining to upgrading Nutch 
> to JDK 17.
> Premier support for Oracle JDK 11 ended 7 months ago (30 Sep 2023).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails

2024-04-25 Thread Sebastian Nagel (Jira)

Sebastian Nagel created NUTCH-3044:
--

 Summary: Generator: NPE when extracting the host part of a URL 
fails
 Key: NUTCH-3044
 URL: https://issues.apache.org/jira/browse/NUTCH-3044
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.21


When extracting the host part of a URL fails, the Generator job fails because 
of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb 
contains an malformed URL, for example, a URL with an unsupported scheme 
(smb://).

{noformat}
Caused by: java.lang.NullPointerException
  at org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439)
  at org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300)
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-04-25 Thread Sebastian Nagel (Jira)

Sebastian Nagel created NUTCH-3043:
--

 Summary: Generator: count URLs rejected by URL filters
 Key: NUTCH-3043
 URL: https://issues.apache.org/jira/browse/NUTCH-3043
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.20
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21


Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
interval or status. It should also count the number of URLs rejected by URL 
filters.

See also [Generator 
metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3040) Upgrade to Hadoop 3.4.0

2024-04-11 Thread Sebastian Nagel (Jira)

Sebastian Nagel created NUTCH-3040:
--

 Summary: Upgrade to Hadoop 3.4.0
 Key: NUTCH-3040
 URL: https://issues.apache.org/jira/browse/NUTCH-3040
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.21


[Hadoop 3.4.0|https://hadoop.apache.org/release/3.4.0.html] has been released.

Many dependencies are upgraded, including commons-io 2.14.0 which would have 
saved us a lot of work in NUTCH-2959.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-3039:
--

Assignee: Sebastian Nagel

> Failure to handle ftp:// URLs
> -
>
> Key: NUTCH-3039
> URL: https://issues.apache.org/jira/browse/NUTCH-3039
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Nutch fails to handle ftp:// URLs:
> - URLNormalizerBasic returns the empty string because creating the URL 
> instance fails with a MalformedURLException:
>   {noformat}
> echo "ftp://ftp.example.com/path/file.txt; \
>   | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
> - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due 
> to a MalformedURLException:
>   {noformat}
> bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
>"ftp://ftp.example.com/path/file.txt;
> ...
> Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
> java.net.MalformedURLException
> at 
> org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
> ...{noformat}
> The issue is caused by NUTCH-2429:
> - we do not provide a dedicated URL stream handler for ftp URLs
> - but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Sebastian Nagel (Jira)

Sebastian Nagel created NUTCH-3039:
--

 Summary: Failure to handle ftp:// URLs
 Key: NUTCH-3039
 URL: https://issues.apache.org/jira/browse/NUTCH-3039
 Project: Nutch
  Issue Type: Bug
  Components: plugin, protocol
Affects Versions: 1.19
Reporter: Sebastian Nagel
 Fix For: 1.21


Nutch fails to handle ftp:// URLs:
- URLNormalizerBasic returns the empty string because creating the URL instance 
fails with a MalformedURLException:
  {noformat}
echo "ftp://ftp.example.com/path/file.txt; \
  | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
- fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due to 
a MalformedURLException:
  {noformat}
bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
   "ftp://ftp.example.com/path/file.txt;
...
Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
java.net.MalformedURLException
at 
org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
...{noformat}


The issue is caused by NUTCH-2429:
- we do not provide a dedicated URL stream handler for ftp URLs
- but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2937.

Resolution: Fixed

Fixed NUTCH-2959 by using the shaded Tika package. Thanks, [~tallison]!

> parse-tika: review dependency exclusions and avoid dependency conflicts in 
> distributed mode
> ---
>
> Key: NUTCH-2937
> URL: https://issues.apache.org/jira/browse/NUTCH-2937
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> While testing NUTCH-2919 I've seen the following error caused by a 
> conflicting dependency to commons-io:
> - 2.11.0 Nutch core
> - 2.11.0 parse-tika (excluded to avoid duplicated dependencies)
> - 2.5 provided by Hadoop
> This causes errors parsing some office and other documents (but not all), for 
> example:
> {noformat}
> 2022-01-15 01:36:31,365 WARN [FetcherThread] 
> org.apache.nutch.parse.ParseUtil: Error parsing 
> http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431)
> Caused by: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2937:
--

Assignee: Tim Allison

> parse-tika: review dependency exclusions and avoid dependency conflicts in 
> distributed mode
> ---
>
> Key: NUTCH-2937
> URL: https://issues.apache.org/jira/browse/NUTCH-2937
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.20
>
>
> While testing NUTCH-2919 I've seen the following error caused by a 
> conflicting dependency to commons-io:
> - 2.11.0 Nutch core
> - 2.11.0 parse-tika (excluded to avoid duplicated dependencies)
> - 2.5 provided by Hadoop
> This causes errors parsing some office and other documents (but not all), for 
> example:
> {noformat}
> 2022-01-15 01:36:31,365 WARN [FetcherThread] 
> org.apache.nutch.parse.ParseUtil: Error parsing 
> http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431)
> Caused by: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2937:
---
Fix Version/s: 1.20
   (was: 1.21)

> parse-tika: review dependency exclusions and avoid dependency conflicts in 
> distributed mode
> ---
>
> Key: NUTCH-2937
> URL: https://issues.apache.org/jira/browse/NUTCH-2937
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> While testing NUTCH-2919 I've seen the following error caused by a 
> conflicting dependency to commons-io:
> - 2.11.0 Nutch core
> - 2.11.0 parse-tika (excluded to avoid duplicated dependencies)
> - 2.5 provided by Hadoop
> This causes errors parsing some office and other documents (but not all), for 
> example:
> {noformat}
> 2022-01-15 01:36:31,365 WARN [FetcherThread] 
> org.apache.nutch.parse.ParseUtil: Error parsing 
> http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431)
> Caused by: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3005) Upgrade selenium as needed

2024-04-06 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3005.

Resolution: Implemented

Done by [~lewismc] as part of NUTCH-3036, commit 
[1563396|https://github.com/apache/nutch/blob/1563396d952393462fffab1f686e9ffd5d006cf6/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java#L151]
 .

> Upgrade selenium as needed
> --
>
> Key: NUTCH-3005
> URL: https://issues.apache.org/jira/browse/NUTCH-3005
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.20
>
>
> When we choose to upgrade selenium, we should take note of this blog about 
> changes in headless chromium: 
> https://www.selenium.dev/blog/2023/headless-is-going-away/
> ChromeOptions options = new ChromeOptions();
> options.addArguments("--headless=new");
> WebDriver driver = new ChromeDriver(options);
> driver.get("https://selenium.dev;);
> driver.quit();



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2

2024-04-06 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3016.

Resolution: Duplicate

> Upgrade Apache Ivy to 2.5.2
> ---
>
> Key: NUTCH-3016
> URL: https://issues.apache.org/jira/browse/NUTCH-3016
> Project: Nutch
>  Issue Type: Task
>  Components: build, ivy
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> [Apache Ivy 
> v2.5.2|https://ant.apache.org/ivy/history/2.5.2/release-notes.html] was 
> released on August 20 2023!
> We should upgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2

2024-04-06 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3016:
---
Fix Version/s: 1.20
   (was: 1.21)

> Upgrade Apache Ivy to 2.5.2
> ---
>
> Key: NUTCH-3016
> URL: https://issues.apache.org/jira/browse/NUTCH-3016
> Project: Nutch
>  Issue Type: Task
>  Components: build, ivy
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> [Apache Ivy 
> v2.5.2|https://ant.apache.org/ivy/history/2.5.2/release-notes.html] was 
> released on August 20 2023!
> We should upgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3005) Upgrade selenium as needed

2024-04-06 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3005:
---
Affects Version/s: 1.19

> Upgrade selenium as needed
> --
>
> Key: NUTCH-3005
> URL: https://issues.apache.org/jira/browse/NUTCH-3005
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
>
> When we choose to upgrade selenium, we should take note of this blog about 
> changes in headless chromium: 
> https://www.selenium.dev/blog/2023/headless-is-going-away/
> ChromeOptions options = new ChromeOptions();
> options.addArguments("--headless=new");
> WebDriver driver = new ChromeDriver(options);
> driver.get("https://selenium.dev;);
> driver.quit();



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3005) Upgrade selenium as needed

2024-04-06 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3005:
---
Fix Version/s: 1.20

> Upgrade selenium as needed
> --
>
> Key: NUTCH-3005
> URL: https://issues.apache.org/jira/browse/NUTCH-3005
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.20
>
>
> When we choose to upgrade selenium, we should take note of this blog about 
> changes in headless chromium: 
> https://www.selenium.dev/blog/2023/headless-is-going-away/
> ChromeOptions options = new ChromeOptions();
> options.addArguments("--headless=new");
> WebDriver driver = new ChromeDriver(options);
> driver.get("https://selenium.dev;);
> driver.quit();



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-06 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3028:
---
Affects Version/s: 1.19

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-06 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3028:
---
Fix Version/s: 1.21

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-2960) indexer-elastic: remove plugin from binary package to address licensing issues

2024-03-14 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2960.

Resolution: Won't Fix

The license issue is addressed by NUTCH-3008.

> indexer-elastic: remove plugin from binary package to address licensing issues
> --
>
> Key: NUTCH-2960
> URL: https://issues.apache.org/jira/browse/NUTCH-2960
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
>
> The license of Elasticsearch has changed with v7.11.0 and upwards and is (if 
> correct) not more compatible with the Apache license. Accordingly, we should 
> not further ship Elastic jars with the binary package.
> It should be possible to keep the indexer-elastic plugin in the source 
> package as an [optional|https://www.apache.org/legal/resolved.html#optional] 
> dependency (indexer-solr is the default indexing backend and more are 
> available).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-2960) indexer-elastic: remove plugin from binary package to address licensing issues

2024-03-14 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-2960.
--

> indexer-elastic: remove plugin from binary package to address licensing issues
> --
>
> Key: NUTCH-2960
> URL: https://issues.apache.org/jira/browse/NUTCH-2960
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
>
> The license of Elasticsearch has changed with v7.11.0 and upwards and is (if 
> correct) not more compatible with the Apache license. Accordingly, we should 
> not further ship Elastic jars with the binary package.
> It should be possible to keep the indexer-elastic plugin in the source 
> package as an [optional|https://www.apache.org/legal/resolved.html#optional] 
> dependency (indexer-solr is the default indexing backend and more are 
> available).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-2960) indexer-elastic: remove plugin from binary package to address licensing issues

2024-03-14 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2960:
---
Fix Version/s: (was: 1.20)

> indexer-elastic: remove plugin from binary package to address licensing issues
> --
>
> Key: NUTCH-2960
> URL: https://issues.apache.org/jira/browse/NUTCH-2960
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
>
> The license of Elasticsearch has changed with v7.11.0 and upwards and is (if 
> correct) not more compatible with the Apache license. Accordingly, we should 
> not further ship Elastic jars with the binary package.
> It should be possible to keep the indexer-elastic plugin in the source 
> package as an [optional|https://www.apache.org/legal/resolved.html#optional] 
> dependency (indexer-solr is the default indexing backend and more are 
> available).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3008) indexer-elastic: downgrade to ES 7.10.2 to address licensing issues

2024-03-14 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3008.

Resolution: Fixed

> indexer-elastic: downgrade to ES 7.10.2 to address licensing issues
> ---
>
> Key: NUTCH-3008
> URL: https://issues.apache.org/jira/browse/NUTCH-3008
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, plugin
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Downgrade to ES 7.10.2 (licensed under ASF 2.0) as an alternative solution to 
> address the licensing issues of the indexer-elastic plugin.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-14 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3029.

Resolution: Implemented

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.20
>
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-14 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-3029.
--

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.20
>
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Reopened] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-14 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-3029:

  Assignee: Sebastian Nagel  (was: Markus Jelsma)

Reopen to update "Fix version(s)" - add 1.20, to make it appear in the release 
notes.

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>Assignee: Sebastian Nagel
>Priority: Minor
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-14 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3029:
---
Fix Version/s: 1.20

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.20
>
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3035) Update license and notice file for release of 1.20

2024-03-13 Thread Sebastian Nagel (Jira)

Sebastian Nagel created NUTCH-3035:
--

 Summary: Update license and notice file for release of 1.20 
 Key: NUTCH-3035
 URL: https://issues.apache.org/jira/browse/NUTCH-3035
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.20
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.20


Close to the release of 1.20 the license and notice files should be updated to 
contain all (third-party) licenses of all dependencies. Cf. NUTCH-2290 and 
NUTCH-2981.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL

2023-11-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3025.

Resolution: Implemented

> urlfilter-fast to filter based on the length of the URL
> ---
>
> Key: NUTCH-3025
> URL: https://issues.apache.org/jira/browse/NUTCH-3025
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, urlfilter
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Major
> Fix For: 1.20
>
>
> There currently is no filter implementation to remove URLs based on their 
> length or the length of their path / query.
> Doing so with the regex filter would be inefficient, instead we could 
> implement it in _urlfilter-fast _



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL

2023-11-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3025:
---
Component/s: plugin
 urlfilter

> urlfilter-fast to filter based on the length of the URL
> ---
>
> Key: NUTCH-3025
> URL: https://issues.apache.org/jira/browse/NUTCH-3025
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, urlfilter
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Major
> Fix For: 1.20
>
>
> There currently is no filter implementation to remove URLs based on their 
> length or the length of their path / query.
> Doing so with the regex filter would be inefficient, instead we could 
> implement it in _urlfilter-fast _



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-11-08 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784030#comment-17784030
 ] 

Sebastian Nagel commented on NUTCH-3017:


Thanks, [~jnioche]

> Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
> ---
>
> Key: NUTCH-3017
> URL: https://issues.apache.org/jira/browse/NUTCH-3017
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, urlfilter
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Minor
> Fix For: 1.20
>
>
> This provide an easier way to refresh the resources since no rebuild of the 
> jar will be needed. The path can point to either HDFS or S3. Additionally, 
> .gz files should be handled automatically



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-11-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3017.

Resolution: Implemented

> Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
> ---
>
> Key: NUTCH-3017
> URL: https://issues.apache.org/jira/browse/NUTCH-3017
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, urlfilter
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Minor
> Fix For: 1.20
>
>
> This provide an easier way to refresh the resources since no rebuild of the 
> jar will be needed. The path can point to either HDFS or S3. Additionally, 
> .gz files should be handled automatically



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-10-30 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3017:
---
Component/s: plugin
 urlfilter

> Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
> ---
>
> Key: NUTCH-3017
> URL: https://issues.apache.org/jira/browse/NUTCH-3017
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, urlfilter
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Minor
> Fix For: 1.20
>
>
> This provide an easier way to refresh the resources since no rebuild of the 
> jar will be needed. The path can point to either HDFS or S3. Additionally, 
> .gz files should be handled automatically



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-10-30 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3017:
---
Fix Version/s: 1.20

> Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
> ---
>
> Key: NUTCH-3017
> URL: https://issues.apache.org/jira/browse/NUTCH-3017
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Minor
> Fix For: 1.20
>
>
> This provide an easier way to refresh the resources since no rebuild of the 
> jar will be needed. The path can point to either HDFS or S3. Additionally, 
> .gz files should be handled automatically



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents

2023-10-21 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3012.

Resolution: Fixed

> SegmentReader when dumping with option -recode: NPE on unparsed documents
> -
>
> Key: NUTCH-3012
> URL: https://issues.apache.org/jira/browse/NUTCH-3012
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> SegmentReader when called with the flag {{-recode}} fails with a NPE when 
> trying to stringify the raw content of unparsed documents:
> {noformat}
> $> bin/nutch readseg  -dump crawl/segments/20231009065431 
> crawl/segreader/20231009065431 -recode
> ...
> 2023-10-09 07:55:18,451 INFO mapreduce.Job: Task Id : 
> attempt_1696825862783_0005_r_00_0, Status : FAILED
> Error: java.lang.NullPointerException: charset
> at java.base/java.lang.String.(String.java:504)
> at java.base/java.lang.String.(String.java:561)
> at org.apache.nutch.protocol.Content.toString(Content.java:297)
> at 
> org.apache.nutch.segment.SegmentReader$InputCompatReducer.reduce(SegmentReader.java:189)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3011) HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)

2023-10-21 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3011.

Resolution: Implemented

> HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors 
> (HTTP 5xx)
> 
>
> Key: NUTCH-3011
> URL: https://issues.apache.org/jira/browse/NUTCH-3011
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> HttpRobotRulesParser should handle HTTP 429 Too Many Requests same as server 
> errors (HTTP 5xx), that is if configured signalize Fetcher to delay requests. 
> See also NUTCH-2573 and 
> https://support.google.com/webmasters/answer/9679690#robots_details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-10-21 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2990.

Resolution: Implemented

Thanks, everybody!

> HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
> ---
>
> Key: NUTCH-2990
> URL: https://issues.apache.org/jira/browse/NUTCH-2990
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol, robots
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> The robots.txt parser 
> ([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html])
>  follows only one redirect when fetching the robots.txt while the robots.txt 
> RFC 9309 recommends to follow 5 redirects:
> {quote} 2.3.1.2. Redirects
> It's possible that a server responds to a robots.txt fetch request with a 
> redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers 
> SHOULD follow at least five consecutive redirects, even across authorities 
> (for example, hosts in the case of HTTP).
> If a robots.txt file is reached within five consecutive redirects, the 
> robots.txt file MUST be fetched, parsed, and its rules followed in the 
> context of the initial authority. If there are more than five consecutive 
> redirects, crawlers MAY assume that the robots.txt file is unavailable.
> (https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote}
> While following redirects, the parser should check whether the redirect 
> location is itself a "/robots.txt" on a different host and then try to read 
> it from the cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (NUTCH-3009) Upgrade to Hadoop 3.3.6

2023-10-21 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-3009:
--

Assignee: Sebastian Nagel

> Upgrade to Hadoop 3.3.6
> ---
>
> Key: NUTCH-3009
> URL: https://issues.apache.org/jira/browse/NUTCH-3009
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Upgrade to [Hadoop 3.3.6|https://hadoop.apache.org/release/3.3.6.html], the 
> latest available release of Hadoop (release date: 2023-06-23).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3009) Upgrade to Hadoop 3.3.6

2023-10-21 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3009.

Resolution: Implemented

> Upgrade to Hadoop 3.3.6
> ---
>
> Key: NUTCH-3009
> URL: https://issues.apache.org/jira/browse/NUTCH-3009
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Upgrade to [Hadoop 3.3.6|https://hadoop.apache.org/release/3.3.6.html], the 
> latest available release of Hadoop (release date: 2023-06-23).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3006) Downgrade Tika dependency to 2.2.1 (core and parse-tika)

2023-10-21 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3006.

Fix Version/s: (was: 1.20)
   Resolution: Abandoned

> Downgrade Tika dependency to 2.2.1 (core and parse-tika)
> 
>
> Key: NUTCH-3006
> URL: https://issues.apache.org/jira/browse/NUTCH-3006
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Major
>
> Tika 2.3.0 and upwards depend on a commons-io 2.11.0 (or even higher) which 
> is not available when Nutch is used on Hadoop. Only Hadoop 3.4.0 is expected 
> to ship with commons-io 2.11.0 (HADOOP-18301), all currently released 
> versions provide commons-io 2.8.0. Because Hadoop-required dependencies are 
> enforced in (pseudo)distributed mode, using Tika may cause issues, see 
> NUTCH-2937 and NUTCH-2959.
> [~lewismc] suggested in the discussion of [Githup PR 
> #776|https://github.com/apache/nutch/pull/776] to downgrade to Tika 2.2.1 to 
> resolve these issues for now and until Hadoop 3.4.0 becomes available.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (NUTCH-3002) Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive

2023-10-21 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-3002:
--

Assignee: Sebastian Nagel

> Protocol-okhttp HttpResponse: HTTP header metadata lookup should be 
> case-insensitive
> 
>
> Key: NUTCH-3002
> URL: https://issues.apache.org/jira/browse/NUTCH-3002
> Project: Nutch
>  Issue Type: Bug
>  Components: metadata, plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Lookup of HTTP headers in the class HttpResponse should be case-insensitive - 
> for example, any "Location" header should be returned independent from the 
> casing sent by the sender.
> While protocol-http uses the class SpellCheckedMetadata which provides 
> case-insensitive lookups (as part of the spell-checking functionality), 
> protocol-okhttp relies on the class Metadata which stores the metadata values 
> case-sensitive.
> It's a good question, whether we still need to spell-check HTTP headers. 
> However, case-insensitive look-ups are definitely required. Especially, since 
> HTTP header names are case-insensitive in HTTP/2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3002) Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive

2023-10-21 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3002.

Resolution: Fixed

> Protocol-okhttp HttpResponse: HTTP header metadata lookup should be 
> case-insensitive
> 
>
> Key: NUTCH-3002
> URL: https://issues.apache.org/jira/browse/NUTCH-3002
> Project: Nutch
>  Issue Type: Bug
>  Components: metadata, plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Lookup of HTTP headers in the class HttpResponse should be case-insensitive - 
> for example, any "Location" header should be returned independent from the 
> casing sent by the sender.
> While protocol-http uses the class SpellCheckedMetadata which provides 
> case-insensitive lookups (as part of the spell-checking functionality), 
> protocol-okhttp relies on the class Metadata which stores the metadata values 
> case-sensitive.
> It's a good question, whether we still need to spell-check HTTP headers. 
> However, case-insensitive look-ups are definitely required. Especially, since 
> HTTP header names are case-insensitive in HTTP/2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3014) Standardize NutchJob job names

2023-10-21 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778103#comment-17778103
 ] 

Sebastian Nagel commented on NUTCH-3014:


If there is a single data name/directory (CrawlDb, segment, etc.), using it as 
part of the additional info would make the job name more unique. Imagine a long 
list of generate - fetch - updatedb jobs: adding the segment for the "generator 
partition" and fetcher job, makes it easier to figure out where in the crawl 
workflow a job was located. If there are multiple workflows running in 
concurrently, the CrawlDb name/path would be also a useful discriminatory 
component.

> Standardize NutchJob job names
> --
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name{{{}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _*Nutch ${ClassName}* *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank Inverter_
>  * _Nutch CrawlDb + $crawldb_
>  * _Nutch LinkDbReader + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents

2023-10-09 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3012:
---
Description: 
SegmentReader when called with the flag {{-recode}} fails with a NPE when 
trying to stringify the raw content of unparsed documents:
{noformat}
$> bin/nutch readseg  -dump crawl/segments/20231009065431 
crawl/segreader/20231009065431 -recode
...
2023-10-09 07:55:18,451 INFO mapreduce.Job: Task Id : 
attempt_1696825862783_0005_r_00_0, Status : FAILED
Error: java.lang.NullPointerException: charset
at java.base/java.lang.String.(String.java:504)
at java.base/java.lang.String.(String.java:561)
at org.apache.nutch.protocol.Content.toString(Content.java:297)
at 
org.apache.nutch.segment.SegmentReader$InputCompatReducer.reduce(SegmentReader.java:189)
{noformat}

> SegmentReader when dumping with option -recode: NPE on unparsed documents
> -
>
> Key: NUTCH-3012
> URL: https://issues.apache.org/jira/browse/NUTCH-3012
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> SegmentReader when called with the flag {{-recode}} fails with a NPE when 
> trying to stringify the raw content of unparsed documents:
> {noformat}
> $> bin/nutch readseg  -dump crawl/segments/20231009065431 
> crawl/segreader/20231009065431 -recode
> ...
> 2023-10-09 07:55:18,451 INFO mapreduce.Job: Task Id : 
> attempt_1696825862783_0005_r_00_0, Status : FAILED
> Error: java.lang.NullPointerException: charset
> at java.base/java.lang.String.(String.java:504)
> at java.base/java.lang.String.(String.java:561)
> at org.apache.nutch.protocol.Content.toString(Content.java:297)
> at 
> org.apache.nutch.segment.SegmentReader$InputCompatReducer.reduce(SegmentReader.java:189)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents

2023-10-09 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3012:
---
Summary: SegmentReader when dumping with option -recode: NPE on unparsed 
documents  (was: SegmentReader when dumping with option -recode: NPE on 
documents without charset defined)

> SegmentReader when dumping with option -recode: NPE on unparsed documents
> -
>
> Key: NUTCH-3012
> URL: https://issues.apache.org/jira/browse/NUTCH-3012
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on documents without charset defined

2023-10-09 Thread Sebastian Nagel (Jira)

Sebastian Nagel created NUTCH-3012:
--

 Summary: SegmentReader when dumping with option -recode: NPE on 
documents without charset defined
 Key: NUTCH-3012
 URL: https://issues.apache.org/jira/browse/NUTCH-3012
 Project: Nutch
  Issue Type: Bug
  Components: segment
Affects Versions: 1.19
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.20






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-10-03 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771445#comment-17771445
 ] 

Sebastian Nagel commented on NUTCH-2959:


Hi [~tallison], it's your decision whether the time is worth to spend. If you 
want to continue, I'm happy to test it. Personally, I'd just wait for Hadoop 
3.4.0 which means for now downgrading to Tika 2.2.1 and likely have a hug jump 
forward in the included Tika version only with Nutch 1.21. We have the open PR: 
anybody, working in local mode and relying on a newer Tika version, can pick 
the PR. Unfortunately (or not?), Hadoop is sometimes conservative or slow with 
upgrades, same for the support of Java 17 (NUTCH-2987 / HADOOP-17177).

> Upgrade to Apache Tika 2.9.0
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-1130) JUnit test for Any23 RDF plugin

2023-10-03 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1130.

Resolution: Won't Do

Closing - the any23 project has retired and the any23 plugin was removed from 
Nutch (NUTCH-2998).

> JUnit test for Any23 RDF plugin
> ---
>
> Key: NUTCH-1130
> URL: https://issues.apache.org/jira/browse/NUTCH-1130
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>
> The JUnit test should be written prior to the progression of the Any23 Nutch 
> plugin



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-1130) JUnit test for Any23 RDF plugin

2023-10-03 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-1130.
--

> JUnit test for Any23 RDF plugin
> ---
>
> Key: NUTCH-1130
> URL: https://issues.apache.org/jira/browse/NUTCH-1130
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>
> The JUnit test should be written prior to the progression of the Any23 Nutch 
> plugin



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-2938) Use Any23's RepositoryWriter to write structured data to Rdf4j repository

2023-10-03 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2938.

Resolution: Won't Do

Closing - the any23 project has retired and the any23 plugin was removed from 
Nutch (NUTCH-2998). See also the comment in the linked PR.

> Use Any23's RepositoryWriter to write structured data to Rdf4j repository
> -
>
> Key: NUTCH-2938
> URL: https://issues.apache.org/jira/browse/NUTCH-2938
> Project: Nutch
>  Issue Type: Improvement
>  Components: any23, plugin
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> I have been running a patch which leverages [Any23's 
> RepositoryWriter|https://any23.apache.org/apidocs/org/apache/any23/writer/RepositoryWriter.html]
>  (implemented as one of a number of TripleHandler's via 
> [CompositeTripleHandler|https://any23.apache.org/apidocs/org/apache/any23/writer/CompositeTripleHandler.html])
>  to write Any23 extractions to 
> [GraphDB|https://www.ontotext.com/products/graphdb/]. This enables us to 
> build a content graph from data across the enterprise.
> This feature is turned off by default so will not change existing Any23 
> behaviour. I have concerns about the performance of this patch because right 
> now we need to create a new repository connection for each URL. This is not 
> great so I will definitely improve on it.
> PR coming up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-2938) Use Any23's RepositoryWriter to write structured data to Rdf4j repository

2023-10-03 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-2938.
--

> Use Any23's RepositoryWriter to write structured data to Rdf4j repository
> -
>
> Key: NUTCH-2938
> URL: https://issues.apache.org/jira/browse/NUTCH-2938
> Project: Nutch
>  Issue Type: Improvement
>  Components: any23, plugin
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> I have been running a patch which leverages [Any23's 
> RepositoryWriter|https://any23.apache.org/apidocs/org/apache/any23/writer/RepositoryWriter.html]
>  (implemented as one of a number of TripleHandler's via 
> [CompositeTripleHandler|https://any23.apache.org/apidocs/org/apache/any23/writer/CompositeTripleHandler.html])
>  to write Any23 extractions to 
> [GraphDB|https://www.ontotext.com/products/graphdb/]. This enables us to 
> build a content graph from data across the enterprise.
> This feature is turned off by default so will not change existing Any23 
> behaviour. I have concerns about the performance of this patch because right 
> now we need to create a new repository connection for each URL. This is not 
> great so I will definitely improve on it.
> PR coming up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-2938) Use Any23's RepositoryWriter to write structured data to Rdf4j repository

2023-10-03 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2938:
---
Fix Version/s: (was: 1.20)

> Use Any23's RepositoryWriter to write structured data to Rdf4j repository
> -
>
> Key: NUTCH-2938
> URL: https://issues.apache.org/jira/browse/NUTCH-2938
> Project: Nutch
>  Issue Type: Improvement
>  Components: any23, plugin
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> I have been running a patch which leverages [Any23's 
> RepositoryWriter|https://any23.apache.org/apidocs/org/apache/any23/writer/RepositoryWriter.html]
>  (implemented as one of a number of TripleHandler's via 
> [CompositeTripleHandler|https://any23.apache.org/apidocs/org/apache/any23/writer/CompositeTripleHandler.html])
>  to write Any23 extractions to 
> [GraphDB|https://www.ontotext.com/products/graphdb/]. This enables us to 
> build a content graph from data across the enterprise.
> This feature is turned off by default so will not change existing Any23 
> behaviour. I have concerns about the performance of this patch because right 
> now we need to create a new repository connection for each URL. This is not 
> great so I will definitely improve on it.
> PR coming up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-2853) bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean

2023-10-03 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2853.

Resolution: Fixed

> bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean
> -
>
> Key: NUTCH-2853
> URL: https://issues.apache.org/jira/browse/NUTCH-2853
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Priority: Major
>  Labels: help-wanted
> Fix For: 1.20
>
>
> The commands "solrindex", "solrdedup" and "solrclean" are deprecated since 7 
> years and should be removed to avoid any confusions (one example: 
> https://stackoverflow.com/questions/66376609/nutch-solr-index-is-failing).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-2897) Do not supress deprecated API warnings

2023-10-03 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2897.

Resolution: Fixed

> Do not supress deprecated API warnings
> --
>
> Key: NUTCH-2897
> URL: https://issues.apache.org/jira/browse/NUTCH-2897
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.18
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We suppress deprecated warnings in three places
> # 
> [Plugin.java#L92-L96|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/plugin/Plugin.java#L92-L96]
> # 
> [NutchJob.java#L35-L38|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/NutchJob.java#L35-L38],
>  and
> # 
> [TikaParser.java#L92-L95|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L92-L95]
> Instead of suppressing the warnings we should instead use the correct 
> *@Deprecated* annotation and *@deprecated* Javadoc. This is not difficult to 
> do and should have been done first time around.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3010) Injector: count unique number of injected URLs

2023-10-02 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3010.

Resolution: Fixed

> Injector: count unique number of injected URLs
> --
>
> Key: NUTCH-3010
> URL: https://issues.apache.org/jira/browse/NUTCH-3010
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Injector uses two counters: one for the total number of injected URLs, the 
> other for the number of URLs "merged", that is already in CrawlDb. There is 
> now counter for the number of unique URLs injected which may lead to wrong 
> counts if the seed files contain duplicates:
> Suppose the following seed file which contains a duplicated URL:
> {noformat}
> $> cat seeds_with_duplicates.txt 
> https://www.example.org/page1.html
> https://www.example.org/page2.html
> https://www.example.org/page2.html
> $> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt
> ...
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
> rejected by filters: 0
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
> injected after normalization and filtering: 3
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
> injected but already in CrawlDb: 0
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls 
> injected: 3
> ...
> {noformat}
> However, because of the duplicated URL, only 2 URLs were injected into the 
> CrawlDb:
> {noformat}
> $> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats
> ...
> 2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls:   2
> ...
> {noformat}
> If the Injector job is run again with the same input, we get the erroneous 
> output, that still one "new URL" was injected:
> {noformat}
> 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
> rejected by filters: 0
> 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
> injected after normalization and filtering: 3
> 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls 
> injected but already in CrawlDb: 2
> 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls 
> injected: 1
> {noformat}
> This is because the urls_merged counter counts unique items, while 
> url_injected does not, and the shown number is the difference between both 
> counters.
> Adding a counter to count the number of unique injected URLs will allow to 
> get the correct count of newly injected URLs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3011) HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)

2023-10-01 Thread Sebastian Nagel (Jira)

Sebastian Nagel created NUTCH-3011:
--

 Summary: HttpRobotRulesParser: handle HTTP 429 Too Many Requests 
same as server errors (HTTP 5xx)
 Key: NUTCH-3011
 URL: https://issues.apache.org/jira/browse/NUTCH-3011
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.19
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.20


HttpRobotRulesParser should handle HTTP 429 Too Many Requests same as server 
errors (HTTP 5xx), that is if configured signalize Fetcher to delay requests. 
See also NUTCH-2573 and 
https://support.google.com/webmasters/answer/9679690#robots_details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-1373) Implement consistent execution of normalising and filtering in Generator

2023-10-01 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-1373.
--

> Implement consistent execution of normalising and filtering in Generator
> 
>
> Key: NUTCH-1373
> URL: https://issues.apache.org/jira/browse/NUTCH-1373
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Priority: Minor
>
> As per discussion here [0] this issue should address the inconsistencies we 
> see in the scheduled execution of normalising and filtering between Nutchgora 
> Generator Mapper and trunk Generator mapper/reducer.
> Hopefully we can come to some consensus as to the best approach acorss both 
> dists. 
> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg06360.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-1373) Implement consistent execution of normalising and filtering in Generator

2023-10-01 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1373.

Resolution: Abandoned

Closing as Nutch 2.x (aka. nutchgora) isn't maintained anymore.

> Implement consistent execution of normalising and filtering in Generator
> 
>
> Key: NUTCH-1373
> URL: https://issues.apache.org/jira/browse/NUTCH-1373
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Priority: Minor
>
> As per discussion here [0] this issue should address the inconsistencies we 
> see in the scheduled execution of normalising and filtering between Nutchgora 
> Generator Mapper and trunk Generator mapper/reducer.
> Hopefully we can come to some consensus as to the best approach acorss both 
> dists. 
> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg06360.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-1374) Workaround for license headers

2023-10-01 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770833#comment-17770833
 ] 

Sebastian Nagel commented on NUTCH-1374:


The package.html files were replaced by package-info.java containing a license 
header in NUTCH-2849

> Workaround for license headers
> --
>
> Key: NUTCH-1374
> URL: https://issues.apache.org/jira/browse/NUTCH-1374
> Project: Nutch
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 1.4, nutchgora
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Currently in both versions of Nutch we have two types of files which DO NOT 
> contain license headers; namely all package.html files and the test files 
> within the language detection plugin. On my initial tests, adding license 
> headers to the language test files breaks the tests so we need to find a 
> workaround (or the correct synatx) to add commented out license headers to 
> these files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-1635) New crawldb sometimes ends up in current

2023-10-01 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770831#comment-17770831
 ] 

Sebastian Nagel commented on NUTCH-1635:


Hi [~markus17], did this continue to happen in the last years? Esp., after 
upgrading the MapReduce API (NUTCH-2375).

> New crawldb sometimes ends up in current
> 
>
> Key: NUTCH-1635
> URL: https://issues.apache.org/jira/browse/NUTCH-1635
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Priority: Major
>
> In some weird cases the newly created crawldb by updatedb ends up in 
> crawl/crawldb/current//. So instead of replacing current/, it ends up 
> inside current/! This causes the generator to fail.
> It's impossible to reliably reproduce the problem. It only happened a couple 
> of times in the last few years.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-1947) Overhaul o.a.n.parse.OutlinkExtractor.java

2023-10-01 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1947.

Resolution: Abandoned

Closing because OutlinkExtractor has seen many updates since then: upgrade to 
Java 8, replacement of Apache ORO to java.util.regex, etc.

> Overhaul o.a.n.parse.OutlinkExtractor.java 
> ---
>
> Key: NUTCH-1947
> URL: https://issues.apache.org/jira/browse/NUTCH-1947
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3, 1.9
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Right now in both trunk and 2.X, the 
> [OutlinkExtractor.java|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java]
>  class need a bit of TLC. It is referencing JDK1.5 in a few places, there are 
> misleading URL entries and it boasts some interesting @Deprecation methods 
> which we could ideally remove.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-1947) Overhaul o.a.n.parse.OutlinkExtractor.java

2023-10-01 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-1947.
--

> Overhaul o.a.n.parse.OutlinkExtractor.java 
> ---
>
> Key: NUTCH-1947
> URL: https://issues.apache.org/jira/browse/NUTCH-1947
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3, 1.9
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Right now in both trunk and 2.X, the 
> [OutlinkExtractor.java|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java]
>  class need a bit of TLC. It is referencing JDK1.5 in a few places, there are 
> misleading URL entries and it boasts some interesting @Deprecation methods 
> which we could ideally remove.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-2053) Uncessary dependencies included in ivy.xml (post NUTCH-2038)

2023-10-01 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2053.

Resolution: Abandoned

Closing this old issue (8 years), assuming that dependencies have been updated 
and cleaned up multiple times since then.

> Uncessary dependencies included in ivy.xml (post NUTCH-2038)
> 
>
> Key: NUTCH-2053
> URL: https://issues.apache.org/jira/browse/NUTCH-2053
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Currently in trunk we have an unnecessary dependency included within 
> ivy/ivy.xml
> https://github.com/apache/nutch/blob/trunk/ivy/ivy.xml#L99-L101
> This needs to be removed.
> [~asitang] can you please provide context as to why this is OK? I don't want 
> to break your code so sorry for lack of understanding. Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-2053) Uncessary dependencies included in ivy.xml (post NUTCH-2038)

2023-10-01 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-2053.
--

> Uncessary dependencies included in ivy.xml (post NUTCH-2038)
> 
>
> Key: NUTCH-2053
> URL: https://issues.apache.org/jira/browse/NUTCH-2053
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Currently in trunk we have an unnecessary dependency included within 
> ivy/ivy.xml
> https://github.com/apache/nutch/blob/trunk/ivy/ivy.xml#L99-L101
> This needs to be removed.
> [~asitang] can you please provide context as to why this is OK? I don't want 
> to break your code so sorry for lack of understanding. Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-2423) Update contributor info page

2023-10-01 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2423.

Fix Version/s: (was: 1.20)
   Resolution: Fixed

The wiki pages were updated in 2020 and 2021. Thanks for reporting, [~krichter] 
!

> Update contributor info page
> 
>
> Key: NUTCH-2423
> URL: https://issues.apache.org/jira/browse/NUTCH-2423
> Project: Nutch
>  Issue Type: Task
>  Components: documentation, wiki
>Reporter: Karl-Philipp Richter
>Priority: Major
>  Labels: easytask, help-wanted
>
> The [contributor info 
> page](https://wiki.apache.org/nutch/Becoming_A_Nutch_Developer) still 
> mentions subversion as SCM which I assume is obsolete because there's 
> git://git.apache.org/nutch.git. It should mention how the devs with write 
> access deal with pull/merge requests in general or on different popular 
> platforms (the information that they're not accepted is valuable as well).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-2820) Review sample files used in any23 unit tests

2023-09-30 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2820.

Resolution: Resolved

Resolved with the removal of the any23 plugin (NUTCH-2998).

> Review sample files used in any23 unit tests
> 
>
> Key: NUTCH-2820
> URL: https://issues.apache.org/jira/browse/NUTCH-2820
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin
>Affects Versions: 1.17
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.20
>
>
> The sample files used by unit tests of the any23 plugin include content not 
> applicable to the Apache license. These should removed or stripped to a 
> minimal snippet (mostly HTML markup).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-2888) Selenium Protocol: Support for Selenium 4

2023-09-30 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2888.

Resolution: Duplicate

Thanks, [~mmkivist]! This issue was resolved by NUTCH-2980 and will be included 
in the 1.20 release of Nutch.

> Selenium Protocol: Support for Selenium 4
> -
>
> Key: NUTCH-2888
> URL: https://issues.apache.org/jira/browse/NUTCH-2888
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.18
>Reporter: Mikko Kivistoe
>Priority: Minor
> Fix For: 1.20
>
>
> Hi,
> Selenium 4 is out and it's Grid version supports now HTTPS traffic between 
> the Hub and Nodes. The Selenium 4 api has changed, and it would be good to 
> have Nutch compatible with it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-2888) Selenium Protocol: Support for Selenium 4

2023-09-30 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2888:
---
Affects Version/s: 1.18

> Selenium Protocol: Support for Selenium 4
> -
>
> Key: NUTCH-2888
> URL: https://issues.apache.org/jira/browse/NUTCH-2888
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.18
>Reporter: Mikko Kivistoe
>Priority: Minor
> Fix For: 1.20
>
>
> Hi,
> Selenium 4 is out and it's Grid version supports now HTTPS traffic between 
> the Hub and Nodes. The Selenium 4 api has changed, and it would be good to 
> have Nutch compatible with it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-2888) Selenium Protocol: Support for Selenium 4

2023-09-30 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2888:
---
Fix Version/s: 1.20

> Selenium Protocol: Support for Selenium 4
> -
>
> Key: NUTCH-2888
> URL: https://issues.apache.org/jira/browse/NUTCH-2888
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Reporter: Mikko Kivistoe
>Priority: Minor
> Fix For: 1.20
>
>
> Hi,
> Selenium 4 is out and it's Grid version supports now HTTPS traffic between 
> the Hub and Nodes. The Selenium 4 api has changed, and it would be good to 
> have Nutch compatible with it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3007) Fix impossible casts

2023-09-30 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3007.

Resolution: Fixed

Thanks for the review, [~markus17]!

> Fix impossible casts
> 
>
> Key: NUTCH-3007
> URL: https://issues.apache.org/jira/browse/NUTCH-3007
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Spotbugs reports two occurrences of
>   Impossible cast from java.util.ArrayList to String[] in 
> org.apache.nutch.fetcher.Fetcher.run(Map, String)
> Both were introduced later into the {{run(Map args, String 
> crawlId)}} method and obviously never used (would throw a 
> ClassCastException). The code blocks should be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-2852) Method invokes System.exit(...) 9 bugs

2023-09-30 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2852.

Resolution: Fixed

> Method invokes System.exit(...) 9 bugs
> --
>
> Key: NUTCH-2852
> URL: https://issues.apache.org/jira/browse/NUTCH-2852
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.18
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> org.apache.nutch.indexer.IndexingFiltersChecker since first historized release
> In class org.apache.nutch.indexer.IndexingFiltersChecker
> In method org.apache.nutch.indexer.IndexingFiltersChecker.run(String[])
> At IndexingFiltersChecker.java:[line 96]
> Another occurrence at IndexingFiltersChecker.java:[line 129]
> org.apache.nutch.indexer.IndexingFiltersChecker.run(String[]) invokes 
> System.exit(...), which shuts down the entire virtual machine
> Invoking System.exit shuts down the entire Java virtual machine. This should 
> only been done when it is appropriate. Such calls make it hard or impossible 
> for your code to be invoked by other code. Consider throwing a 
> RuntimeException instead.
> Also occurs in
>org.apache.nutch.net.URLFilterChecker since first historized release
>org.apache.nutch.net.URLNormalizerChecker since first historized release
>org.apache.nutch.parse.ParseSegment since first historized release
>org.apache.nutch.parse.ParserChecker since first historized release
>org.apache.nutch.service.NutchServer since first historized release
>org.apache.nutch.tools.CommonCrawlDataDumper since first historized release
>org.apache.nutch.tools.DmozParser since first historized release
>org.apache.nutch.util.AbstractChecker since first historized release 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3010) Injector: count unique number of injected URLs

2023-09-30 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3010:
---
Description: 
Injector uses two counters: one for the total number of injected URLs, the 
other for the number of URLs "merged", that is already in CrawlDb. There is now 
counter for the number of unique URLs injected which may lead to wrong counts 
if the seed files contain duplicates:

Suppose the following seed file which contains a duplicated URL:
{noformat}
$> cat seeds_with_duplicates.txt 
https://www.example.org/page1.html
https://www.example.org/page2.html
https://www.example.org/page2.html

$> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt
...
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
rejected by filters: 0
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected after normalization and filtering: 3
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected but already in CrawlDb: 0
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls 
injected: 3
...
{noformat}
However, because of the duplicated URL, only 2 URLs were injected into the 
CrawlDb:
{noformat}
$> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats
...
2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls:   2
...
{noformat}
If the Injector job is run again with the same input, we get the erroneous 
output, that still one "new URL" was injected:
{noformat}
2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
rejected by filters: 0
2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected after normalization and filtering: 3
2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected but already in CrawlDb: 2
2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls 
injected: 1
{noformat}
This is because the urls_merged counter counts unique items, while url_injected 
does not, and the shown number is the difference between both counters.

Adding a counter to count the number of unique injected URLs will allow to get 
the correct count of newly injected URLs.

  was:
Injector uses two counters: one for the total number of injected URLs, the 
other for the number of URLs "merged", that is already in CrawlDb. There is now 
counter for the number of unique URLs injected which may lead to wrong counts 
if the seed files contain duplicates:

Suppose the following seed file which contains a duplicated URL:

{noformat}
$> cat seeds_with_duplicates.txt 
https://www.example.org/page1.html
https://www.example.org/page2.html
https://www.example.org/page2.html

$> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt
...
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
rejected by filters: 0
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected after normalization and filtering: 3
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected but already in CrawlDb: 0
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls 
injected: 3
...
{noformat}

However, because of the duplicated URL, only 2 URLs were injected into the 
CrawlDb:

{noformat}
$> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats
...
2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls:   2
...
{noformat}

If the Injector job is run again with the same input, we get the erroneous 
output, that still one "new URL" was injected:

{noformat}
2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
rejected by filters: 0
2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected after normalization and filtering: 3
2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected but already in CrawlDb: 2
2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls 
injected: 1
{noformat}

This is because the urls_merged counter counts unique items, while url_injected 
does not.

Adding a counter to count the number of unique injected URLs will allow to get 
the correct count of newly injected URLs.


> Injector: count unique number of injected URLs
> --
>
> Key: NUTCH-3010
> URL: https://issues.apache.org/jira/browse/NUTCH-3010
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Injector uses two counters: one for the total number of injected URLs, the 
> other for the number of URLs "merged", that is already in CrawlDb. There is 
> now counter for the number of

[jira] [Created] (NUTCH-3010) Injector: count unique number of injected URLs

2023-09-30 Thread Sebastian Nagel (Jira)

Sebastian Nagel created NUTCH-3010:
--

 Summary: Injector: count unique number of injected URLs
 Key: NUTCH-3010
 URL: https://issues.apache.org/jira/browse/NUTCH-3010
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Affects Versions: 1.19
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.20


Injector uses two counters: one for the total number of injected URLs, the 
other for the number of URLs "merged", that is already in CrawlDb. There is now 
counter for the number of unique URLs injected which may lead to wrong counts 
if the seed files contain duplicates:

Suppose the following seed file which contains a duplicated URL:

{noformat}
$> cat seeds_with_duplicates.txt 
https://www.example.org/page1.html
https://www.example.org/page2.html
https://www.example.org/page2.html

$> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt
...
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
rejected by filters: 0
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected after normalization and filtering: 3
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected but already in CrawlDb: 0
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls 
injected: 3
...
{noformat}

However, because of the duplicated URL, only 2 URLs were injected into the 
CrawlDb:

{noformat}
$> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats
...
2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls:   2
...
{noformat}

If the Injector job is run again with the same input, we get the erroneous 
output, that still one "new URL" was injected:

{noformat}
2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
rejected by filters: 0
2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected after normalization and filtering: 3
2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected but already in CrawlDb: 2
2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls 
injected: 1
{noformat}

This is because the urls_merged counter counts unique items, while url_injected 
does not.

Adding a counter to count the number of unique injected URLs will allow to get 
the correct count of newly injected URLs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3006) Downgrade Tika dependency to 2.2.1 (core and parse-tika)

2023-09-29 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770320#comment-17770320
 ] 

Sebastian Nagel commented on NUTCH-3006:


> revert CloseShieldInputStream.wrap(), which I think was the only conflict

Yes, looks like it was the only conflict. If it's an option to revert this, 
yes, why not.

The idea of the downgrade was more to avoid that this issue blocks any release. 
And downgrading from 2.3.0 (current master) to 2.2.1 sounds less dramatic.

> how far out Hadoop 3.4.0 is

Even if it's released, it takes some time (a couple of months) until Hadoop 
distributions (for example Apache Bigtop) pick the release and/or users deploy 
it.

> Downgrade Tika dependency to 2.2.1 (core and parse-tika)
> 
>
> Key: NUTCH-3006
> URL: https://issues.apache.org/jira/browse/NUTCH-3006
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Tika 2.3.0 and upwards depend on a commons-io 2.11.0 (or even higher) which 
> is not available when Nutch is used on Hadoop. Only Hadoop 3.4.0 is expected 
> to ship with commons-io 2.11.0 (HADOOP-18301), all currently released 
> versions provide commons-io 2.8.0. Because Hadoop-required dependencies are 
> enforced in (pseudo)distributed mode, using Tika may cause issues, see 
> NUTCH-2937 and NUTCH-2959.
> [~lewismc] suggested in the discussion of [Githup PR 
> #776|https://github.com/apache/nutch/pull/776] to downgrade to Tika 2.2.1 to 
> resolve these issues for now and until Hadoop 3.4.0 becomes available.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-2979) Upgrade Commons Text to 1.10.0

2023-09-28 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770041#comment-17770041
 ] 

Sebastian Nagel commented on NUTCH-2979:


Note: upgrading to Hadoop 3.3.6 (NUTCH-3009) will update the core dependency to 
commons-text 1.10.0

> Upgrade Commons Text to 1.10.0
> --
>
> Key: NUTCH-2979
> URL: https://issues.apache.org/jira/browse/NUTCH-2979
> Project: Nutch
>  Issue Type: Bug
>  Components: build, plugin
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
>  Labels: help-wanted
> Fix For: 1.20
>
>
> In order to address 
> [CVE-2022-42889|https://nvd.nist.gov/vuln/detail/CVE-2022-42889] we should 
> upgrade to commons-text 1.10.0:
> - Nutch core depends on 1.4 which is not affected by the CVE
> - the plugins lib-htmlunit and any23 depend on a vulnerable commons-text 
> version (1.5 - 1.9)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3009) Upgrade to Hadoop 3.3.6

2023-09-28 Thread Sebastian Nagel (Jira)

Sebastian Nagel created NUTCH-3009:
--

 Summary: Upgrade to Hadoop 3.3.6
 Key: NUTCH-3009
 URL: https://issues.apache.org/jira/browse/NUTCH-3009
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.19
Reporter: Sebastian Nagel
 Fix For: 1.20


Upgrade to [Hadoop 3.3.6|https://hadoop.apache.org/release/3.3.6.html], the 
latest available release of Hadoop (release date: 2023-06-23).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-2979) Upgrade Commons Text to 1.10.0

2023-09-28 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2979.

Resolution: Fixed

Resolved, so far, without any direct action:
- Nutch core still depends on 1.4 which is not affected by the CVE
- the plugin any23 was removed (NUTCH-2998)
- the plugin lib-htmlunit now depends on commons-text 1.10.0 after the Selenium 
dependency was upgraded by NUTCH-2980


> Upgrade Commons Text to 1.10.0
> --
>
> Key: NUTCH-2979
> URL: https://issues.apache.org/jira/browse/NUTCH-2979
> Project: Nutch
>  Issue Type: Bug
>  Components: build, plugin
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
>  Labels: help-wanted
> Fix For: 1.20
>
>
> In order to address 
> [CVE-2022-42889|https://nvd.nist.gov/vuln/detail/CVE-2022-42889] we should 
> upgrade to commons-text 1.10.0:
> - Nutch core depends on 1.4 which is not affected by the CVE
> - the plugins lib-htmlunit and any23 depend on a vulnerable commons-text 
> version (1.5 - 1.9)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3008) indexer-elastic: downgrade to ES 7.10.2 to address licensing issues

2023-09-28 Thread Sebastian Nagel (Jira)

Sebastian Nagel created NUTCH-3008:
--

 Summary: indexer-elastic: downgrade to ES 7.10.2 to address 
licensing issues
 Key: NUTCH-3008
 URL: https://issues.apache.org/jira/browse/NUTCH-3008
 Project: Nutch
  Issue Type: Bug
  Components: indexer, plugin
Affects Versions: 1.19
Reporter: Sebastian Nagel
 Fix For: 1.20


Downgrade to ES 7.10.2 (licensed under ASF 2.0) as an alternative solution to 
address the licensing issues of the indexer-elastic plugin.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3007) Fix impossible casts

2023-09-28 Thread Sebastian Nagel (Jira)

Sebastian Nagel created NUTCH-3007:
--

 Summary: Fix impossible casts
 Key: NUTCH-3007
 URL: https://issues.apache.org/jira/browse/NUTCH-3007
 Project: Nutch
  Issue Type: Sub-task
Affects Versions: 1.19
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.20


Spotbugs reports two occurrences of

  Impossible cast from java.util.ArrayList to String[] in 
org.apache.nutch.fetcher.Fetcher.run(Map, String)

Both were introduced later into the {{run(Map args, String 
crawlId)}} method and obviously never used (would throw a ClassCastException). 
The code blocks should be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-2852) Method invokes System.exit(...) 9 bugs

2023-09-28 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769977#comment-17769977
 ] 

Sebastian Nagel commented on NUTCH-2852:


The PR addresses all corresponding issues in the checker tools. That's 
everything without investing too much time: DmozParser and 
CommonCrawlDataDumper would need a closer look, for NutchServer I don't know 
how to stop it gracefully. 

> Method invokes System.exit(...) 9 bugs
> --
>
> Key: NUTCH-2852
> URL: https://issues.apache.org/jira/browse/NUTCH-2852
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.18
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> org.apache.nutch.indexer.IndexingFiltersChecker since first historized release
> In class org.apache.nutch.indexer.IndexingFiltersChecker
> In method org.apache.nutch.indexer.IndexingFiltersChecker.run(String[])
> At IndexingFiltersChecker.java:[line 96]
> Another occurrence at IndexingFiltersChecker.java:[line 129]
> org.apache.nutch.indexer.IndexingFiltersChecker.run(String[]) invokes 
> System.exit(...), which shuts down the entire virtual machine
> Invoking System.exit shuts down the entire Java virtual machine. This should 
> only been done when it is appropriate. Such calls make it hard or impossible 
> for your code to be invoked by other code. Consider throwing a 
> RuntimeException instead.
> Also occurs in
>org.apache.nutch.net.URLFilterChecker since first historized release
>org.apache.nutch.net.URLNormalizerChecker since first historized release
>org.apache.nutch.parse.ParseSegment since first historized release
>org.apache.nutch.parse.ParserChecker since first historized release
>org.apache.nutch.service.NutchServer since first historized release
>org.apache.nutch.tools.CommonCrawlDataDumper since first historized release
>org.apache.nutch.tools.DmozParser since first historized release
>org.apache.nutch.util.AbstractChecker since first historized release 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3006) Downgrade Tika dependency to 2.2.1 (core and parse-tika)

2023-09-26 Thread Sebastian Nagel (Jira)

Sebastian Nagel created NUTCH-3006:
--

 Summary: Downgrade Tika dependency to 2.2.1 (core and parse-tika)
 Key: NUTCH-3006
 URL: https://issues.apache.org/jira/browse/NUTCH-3006
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.20


Tika 2.3.0 and upwards depend on a commons-io 2.11.0 (or even higher) which is 
not available when Nutch is used on Hadoop. Only Hadoop 3.4.0 is expected to 
ship with commons-io 2.11.0 (HADOOP-18301), all currently released versions 
provide commons-io 2.8.0. Because Hadoop-required dependencies are enforced in 
(pseudo)distributed mode, using Tika may cause issues, see NUTCH-2937 and 
NUTCH-2959.

[~lewismc] suggested in the discussion of [Githup PR 
#776|https://github.com/apache/nutch/pull/776] to downgrade to Tika 2.2.1 to 
resolve these issues for now and until Hadoop 3.4.0 becomes available.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3004:
---
Fix Version/s: 1.20

> Avoid NPE in HttpResponse
> -
>
> Key: NUTCH-3004
> URL: https://issues.apache.org/jira/browse/NUTCH-3004
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.20
>
>
> I recently deployed nutch on a FIPS enabled rhel 8 instance, and I got an NPE 
> in HttpResponse.  When I set the log level to debug, I could see what was 
> happening, but it would have been better to get a meaningful exception rather 
> than an NPE.
> The issue is that in the catch clause, the exception is propagated only if 
> the message is "handshake alert..." and then the reconnect fails.  If the 
> message is not that, then the ssl socket remains null, and we get an NPE 
> below the source I quote here.
> I think we should throw the same HTTPException that we do throw in the nested 
> try if the message is not "handshake alert..."
> {code:java}
> try {
>   sslsocket = getSSLSocket(socket, sockHost, sockPort);
>   sslsocket.startHandshake();
> } catch (Exception e) {
>   Http.LOG.debug("SSL connection to {} failed with: {}", url,
>   e.getMessage());
>   if ("handshake alert:  unrecognized_name".equals(e.getMessage())) {
> try {
>   // Reconnect, see NUTCH-2447
>   socket = new Socket();
>   socket.setSoTimeout(http.getTimeout());
>   socket.connect(sockAddr, http.getTimeout());
>   sslsocket = getSSLSocket(socket, "", sockPort);
>   sslsocket.startHandshake();
> } catch (Exception ex) {
>   String msg = "SSL reconnect to " + url + " failed with: "
>   + e.getMessage();
>   throw new HttpException(msg);
> }
>   }
> }
> socket = sslsocket;
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3004:
---
Component/s: plugin
 protocol

> Avoid NPE in HttpResponse
> -
>
> Key: NUTCH-3004
> URL: https://issues.apache.org/jira/browse/NUTCH-3004
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.20
>
>
> I recently deployed nutch on a FIPS enabled rhel 8 instance, and I got an NPE 
> in HttpResponse.  When I set the log level to debug, I could see what was 
> happening, but it would have been better to get a meaningful exception rather 
> than an NPE.
> The issue is that in the catch clause, the exception is propagated only if 
> the message is "handshake alert..." and then the reconnect fails.  If the 
> message is not that, then the ssl socket remains null, and we get an NPE 
> below the source I quote here.
> I think we should throw the same HTTPException that we do throw in the nested 
> try if the message is not "handshake alert..."
> {code:java}
> try {
>   sslsocket = getSSLSocket(socket, sockHost, sockPort);
>   sslsocket.startHandshake();
> } catch (Exception e) {
>   Http.LOG.debug("SSL connection to {} failed with: {}", url,
>   e.getMessage());
>   if ("handshake alert:  unrecognized_name".equals(e.getMessage())) {
> try {
>   // Reconnect, see NUTCH-2447
>   socket = new Socket();
>   socket.setSoTimeout(http.getTimeout());
>   socket.connect(sockAddr, http.getTimeout());
>   sslsocket = getSSLSocket(socket, "", sockPort);
>   sslsocket.startHandshake();
> } catch (Exception ex) {
>   String msg = "SSL reconnect to " + url + " failed with: "
>   + e.getMessage();
>   throw new HttpException(msg);
> }
>   }
> }
> socket = sslsocket;
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3004:
---
Affects Version/s: 1.19

> Avoid NPE in HttpResponse
> -
>
> Key: NUTCH-3004
> URL: https://issues.apache.org/jira/browse/NUTCH-3004
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
>
> I recently deployed nutch on a FIPS enabled rhel 8 instance, and I got an NPE 
> in HttpResponse.  When I set the log level to debug, I could see what was 
> happening, but it would have been better to get a meaningful exception rather 
> than an NPE.
> The issue is that in the catch clause, the exception is propagated only if 
> the message is "handshake alert..." and then the reconnect fails.  If the 
> message is not that, then the ssl socket remains null, and we get an NPE 
> below the source I quote here.
> I think we should throw the same HTTPException that we do throw in the nested 
> try if the message is not "handshake alert..."
> {code:java}
> try {
>   sslsocket = getSSLSocket(socket, sockHost, sockPort);
>   sslsocket.startHandshake();
> } catch (Exception e) {
>   Http.LOG.debug("SSL connection to {} failed with: {}", url,
>   e.getMessage());
>   if ("handshake alert:  unrecognized_name".equals(e.getMessage())) {
> try {
>   // Reconnect, see NUTCH-2447
>   socket = new Socket();
>   socket.setSoTimeout(http.getTimeout());
>   socket.connect(sockAddr, http.getTimeout());
>   sslsocket = getSSLSocket(socket, "", sockPort);
>   sslsocket.startHandshake();
> } catch (Exception ex) {
>   String msg = "SSL reconnect to " + url + " failed with: "
>   + e.getMessage();
>   throw new HttpException(msg);
> }
>   }
> }
> socket = sslsocket;
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2023-09-23 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-585:
--
Priority: Major  (was: Minor)

> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> ---
>
> Key: NUTCH-585
> URL: https://issues.apache.org/jira/browse/NUTCH-585
> Project: Nutch
>  Issue Type: Improvement
>  Components: HTML, parse-filter, parser, plugin
>Affects Versions: 0.9.0
> Environment: All operating systems
>Reporter: Andrea Spinelli
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
> Attachments: blacklist_whitelist_plugin.patch, 
> nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch
>
>
> We are using nutch to index our own web sites; we would like not to index 
> certain parts of our pages, because we know they are not relevant (for 
> instance, there are several links to change the background color) and 
> generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML 
> comments, like
> 
> ... ignored part ...
> 
> We feel this might be useful to someone else, maybe factorizing the comment 
> strings as constants in the configuration files (say parser.html.ignore.start 
> and parser.html.ignore.stop in nutch-site.xml).
> We are almost ready to contribute our code snippet.  Looking forward for any 
> expression of  interest - or for an explanation why waht we are doing is 
> plain wrong!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2023-09-23 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-585:
--
Component/s: parse-filter
 HTML
 parser
 plugin

> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> ---
>
> Key: NUTCH-585
> URL: https://issues.apache.org/jira/browse/NUTCH-585
> Project: Nutch
>  Issue Type: Improvement
>  Components: HTML, parse-filter, parser, plugin
>Affects Versions: 0.9.0
> Environment: All operating systems
>Reporter: Andrea Spinelli
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.20
>
> Attachments: blacklist_whitelist_plugin.patch, 
> nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch
>
>
> We are using nutch to index our own web sites; we would like not to index 
> certain parts of our pages, because we know they are not relevant (for 
> instance, there are several links to change the background color) and 
> generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML 
> comments, like
> 
> ... ignored part ...
> 
> We feel this might be useful to someone else, maybe factorizing the comment 
> strings as constants in the configuration files (say parser.html.ignore.start 
> and parser.html.ignore.stop in nutch-site.xml).
> We are almost ready to contribute our code snippet.  Looking forward for any 
> expression of  interest - or for an explanation why waht we are doing is 
> plain wrong!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2023-09-23 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-585:
-

Assignee: Sebastian Nagel  (was: Markus Jelsma)

> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> ---
>
> Key: NUTCH-585
> URL: https://issues.apache.org/jira/browse/NUTCH-585
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 0.9.0
> Environment: All operating systems
>Reporter: Andrea Spinelli
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.20
>
> Attachments: blacklist_whitelist_plugin.patch, 
> nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch
>
>
> We are using nutch to index our own web sites; we would like not to index 
> certain parts of our pages, because we know they are not relevant (for 
> instance, there are several links to change the background color) and 
> generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML 
> comments, like
> 
> ... ignored part ...
> 
> We feel this might be useful to someone else, maybe factorizing the comment 
> strings as constants in the configuration files (say parser.html.ignore.start 
> and parser.html.ignore.stop in nutch-site.xml).
> We are almost ready to contribute our code snippet.  Looking forward for any 
> expression of  interest - or for an explanation why waht we are doing is 
> plain wrong!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2023-09-23 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-585:
--
Fix Version/s: 1.20

> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> ---
>
> Key: NUTCH-585
> URL: https://issues.apache.org/jira/browse/NUTCH-585
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 0.9.0
> Environment: All operating systems
>Reporter: Andrea Spinelli
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: blacklist_whitelist_plugin.patch, 
> nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch
>
>
> We are using nutch to index our own web sites; we would like not to index 
> certain parts of our pages, because we know they are not relevant (for 
> instance, there are several links to change the background color) and 
> generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML 
> comments, like
> 
> ... ignored part ...
> 
> We feel this might be useful to someone else, maybe factorizing the comment 
> strings as constants in the configuration files (say parser.html.ignore.start 
> and parser.html.ignore.stop in nutch-site.xml).
> We are almost ready to contribute our code snippet.  Looking forward for any 
> expression of  interest - or for an explanation why waht we are doing is 
> plain wrong!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3002) Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive

2023-09-18 Thread Sebastian Nagel (Jira)

Sebastian Nagel created NUTCH-3002:
--

 Summary: Protocol-okhttp HttpResponse: HTTP header metadata lookup 
should be case-insensitive
 Key: NUTCH-3002
 URL: https://issues.apache.org/jira/browse/NUTCH-3002
 Project: Nutch
  Issue Type: Bug
  Components: metadata, plugin, protocol
Affects Versions: 1.19
Reporter: Sebastian Nagel
 Fix For: 1.20


Lookup of HTTP headers in the class HttpResponse should be case-insensitive - 
for example, any "Location" header should be returned independent from the 
casing sent by the sender.
While protocol-http uses the class SpellCheckedMetadata which provides 
case-insensitive lookups (as part of the spell-checking functionality), 
protocol-okhttp relies on the class Metadata which stores the metadata values 
case-sensitive.

It's a good question, whether we still need to spell-check HTTP headers. 
However, case-insensitive look-ups are definitely required. Especially, since 
HTTP header names are case-insensitive in HTTP/2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3000) protocol-selenium returns only the body,strips off the element

2023-09-13 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764692#comment-17764692
 ] 

Sebastian Nagel commented on NUTCH-3000:


+1 Yes, the full HTML seems the best choice for the default.

> protocol-selenium returns only the body,strips off the  element
> --
>
> Key: NUTCH-3000
> URL: https://issues.apache.org/jira/browse/NUTCH-3000
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Reporter: Tim Allison
>Priority: Major
>
> The selenium protocol returns only the body portion of the html, which means 
> that neither the title nor the other page metadata in the  section 
> gets extracted.
> {noformat}
> String innerHtml = driver.findElement(By.tagName("body"))
> .getAttribute("innerHTML");
> {noformat}
> We should return the full html, no?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (NUTCH-2998) Remove the Any23 plugin

2023-09-13 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764672#comment-17764672
 ] 

Sebastian Nagel edited comment on NUTCH-2998 at 9/13/23 1:26 PM:
-

+1

> Are there other lists/communications channels I should pursue with this?

A short notice to user@/dev@ might good.


was (Author: wastl-nagel):
+1

> Remove the Any23 plugin
> ---
>
> Key: NUTCH-2998
> URL: https://issues.apache.org/jira/browse/NUTCH-2998
> Project: Nutch
>  Issue Type: Task
>  Components: any23
>Reporter: Tim Allison
>Priority: Major
>
> I'm not sure how we want to handle this.  Any23 moved to the Attic in June 
> 2023.  We should probably remove it from Nutch?  I'm not sure how abruptly we 
> want to do that.
> We could deprecate it for 1.20 and then remove it in 1.21 or later?  Or we 
> could choose to remove it for 1.20.
> What do you think?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-2998) Remove the Any23 plugin

2023-09-13 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764672#comment-17764672
 ] 

Sebastian Nagel commented on NUTCH-2998:


+1

> Remove the Any23 plugin
> ---
>
> Key: NUTCH-2998
> URL: https://issues.apache.org/jira/browse/NUTCH-2998
> Project: Nutch
>  Issue Type: Task
>  Components: any23
>Reporter: Tim Allison
>Priority: Major
>
> I'm not sure how we want to handle this.  Any23 moved to the Attic in June 
> 2023.  We should probably remove it from Nutch?  I'm not sure how abruptly we 
> want to do that.
> We could deprecate it for 1.20 and then remove it in 1.21 or later?  Or we 
> could choose to remove it for 1.20.
> What do you think?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-2997) Add Override annotations where applicable

2023-08-22 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2997.

Resolution: Implemented

> Add Override annotations where applicable
> -
>
> Key: NUTCH-2997
> URL: https://issues.apache.org/jira/browse/NUTCH-2997
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.20
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (NUTCH-2997) Add Override annotations where applicable

2023-08-22 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2997:
--

Assignee: Sebastian Nagel

> Add Override annotations where applicable
> -
>
> Key: NUTCH-2997
> URL: https://issues.apache.org/jira/browse/NUTCH-2997
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.20
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-2996) Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)

2023-08-22 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2996.

Resolution: Implemented

> Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)
> 
>
> Key: NUTCH-2996
> URL: https://issues.apache.org/jira/browse/NUTCH-2996
> Project: Nutch
>  Issue Type: Improvement
>  Components: robots
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Crawler-commons 1.4 (#1085) robots.txt parser (SimpleRobotRulesParser) 
> introduces a new [API entry point to parse the robots.txt 
> content|https://crawler-commons.github.io/crawler-commons/1.4/crawlercommons/robots/SimpleRobotRulesParser.html#parseContent(java.lang.String,byte%5B%5D,java.lang.String,java.util.Collection)]:
> - it's more efficient by accepting a collection of lower-cased, single-word 
> user-agent product tokens, without the need to tokenize a (comma-separated) 
> list of user-agent strings again with every robots.txt
> - user-agent matching is compliant with [RFC 9309 (section 
> 2.2.1)|https://www.rfc-editor.org/rfc/rfc9309.html#name-the-user-agent-line] 
> only if the new API method is used



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-2995) Upgrade to crawler-commons 1.4

2023-08-22 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2995.

Resolution: Implemented

> Upgrade to crawler-commons 1.4
> --
>
> Key: NUTCH-2995
> URL: https://issues.apache.org/jira/browse/NUTCH-2995
> Project: Nutch
>  Issue Type: Improvement
>  Components: robots
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-08-22 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2993:
---
  Component/s: plugin
   scoring
Affects Version/s: 1.19

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, scoring
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-08-22 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2993.

Resolution: Implemented

Committed/merged. Thanks, [~markus17]!

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, scoring
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-2997) Add Override annotations where applicable

2023-08-16 Thread Sebastian Nagel (Jira)

Sebastian Nagel created NUTCH-2997:
--

 Summary: Add Override annotations where applicable
 Key: NUTCH-2997
 URL: https://issues.apache.org/jira/browse/NUTCH-2997
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.19
Reporter: Sebastian Nagel
 Fix For: 1.20






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3273 matches

Mail list logo