[jira] [Resolved] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3043. Resolution: Implemented > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3039) Failure to handle ftp:// URLs
[ https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3039. Resolution: Fixed > Failure to handle ftp:// URLs > - > > Key: NUTCH-3039 > URL: https://issues.apache.org/jira/browse/NUTCH-3039 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.21 > > > Nutch fails to handle ftp:// URLs: > - URLNormalizerBasic returns the empty string because creating the URL > instance fails with a MalformedURLException: > {noformat} > echo "ftp://ftp.example.com/path/file.txt; \ > | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat} > - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due > to a MalformedURLException: > {noformat} > bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \ >"ftp://ftp.example.com/path/file.txt; > ... > Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: > java.net.MalformedURLException > at > org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113) > ...{noformat} > The issue is caused by NUTCH-2429: > - we do not provide a dedicated URL stream handler for ftp URLs > - but also do not pass ftp:// URLs to the standard JVM handler -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3055) README: fix Github "hub" commands
Sebastian Nagel created NUTCH-3055: -- Summary: README: fix Github "hub" commands Key: NUTCH-3055 URL: https://issues.apache.org/jira/browse/NUTCH-3055 Project: Nutch Issue Type: Bug Components: documentation Affects Versions: 1.20 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Fix For: 1.21 The [README.md|https://github.com/apache/nutch/blob/master/README.md] contains [Github hub|https://hub.github.com/] commands but with "git" as command (executable) name, maybe an alias or some other magic. However, if hub isn't installed, these commands fail with {{git: 'pull-request' is not a git command. See 'git --help'.}} or similar. We should use the command "hub" instead. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842291#comment-17842291 ] Sebastian Nagel commented on NUTCH-3028: +1 lgtm. One question: if there is no parseData, the JEXL expression is not evaluated. Since WARC files may inlcude only the raw HTML plus fetch/capture metadata, successfully parsing a document is not a requirement to archive it in a WARC file. Might be useful to have the JEXL filtering also available for unparsed docs. > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3045) Upgrade from Java 11 to 17
[ https://issues.apache.org/jira/browse/NUTCH-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842284#comment-17842284 ] Sebastian Nagel commented on NUTCH-3045: See also NUTCH-2987. Until HADOOP-17177 / HADOOP-18887 are done, we might be forced to upkeep JDK 11 runtime compatibility, so that Nutch runs on recent Hadoop versions and distributions. I fully agree that Java 17 offers some nice syntax improvements, though. :) > Upgrade from Java 11 to 17 > -- > > Key: NUTCH-3045 > URL: https://issues.apache.org/jira/browse/NUTCH-3045 > Project: Nutch > Issue Type: Task > Components: build, ci/cd >Reporter: Lewis John McGibbney >Priority: Critical > Fix For: 1.21 > > > This parent issue will track and organize work pertaining to upgrading Nutch > to JDK 17. > Premier support for Oracle JDK 11 ended 7 months ago (30 Sep 2023). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails
Sebastian Nagel created NUTCH-3044: -- Summary: Generator: NPE when extracting the host part of a URL fails Key: NUTCH-3044 URL: https://issues.apache.org/jira/browse/NUTCH-3044 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.20 Reporter: Sebastian Nagel Fix For: 1.21 When extracting the host part of a URL fails, the Generator job fails because of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb contains an malformed URL, for example, a URL with an unsupported scheme (smb://). {noformat} Caused by: java.lang.NullPointerException at org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439) at org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3043) Generator: count URLs rejected by URL filters
Sebastian Nagel created NUTCH-3043: -- Summary: Generator: count URLs rejected by URL filters Key: NUTCH-3043 URL: https://issues.apache.org/jira/browse/NUTCH-3043 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.20 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Fix For: 1.21 Generator already counts URLs rejected by the (re)fetch scheduler, by fetch interval or status. It should also count the number of URLs rejected by URL filters. See also [Generator metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3040) Upgrade to Hadoop 3.4.0
Sebastian Nagel created NUTCH-3040: -- Summary: Upgrade to Hadoop 3.4.0 Key: NUTCH-3040 URL: https://issues.apache.org/jira/browse/NUTCH-3040 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.20 Reporter: Sebastian Nagel Fix For: 1.21 [Hadoop 3.4.0|https://hadoop.apache.org/release/3.4.0.html] has been released. Many dependencies are upgraded, including commons-io 2.14.0 which would have saved us a lot of work in NUTCH-2959. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (NUTCH-3039) Failure to handle ftp:// URLs
[ https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-3039: -- Assignee: Sebastian Nagel > Failure to handle ftp:// URLs > - > > Key: NUTCH-3039 > URL: https://issues.apache.org/jira/browse/NUTCH-3039 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.21 > > > Nutch fails to handle ftp:// URLs: > - URLNormalizerBasic returns the empty string because creating the URL > instance fails with a MalformedURLException: > {noformat} > echo "ftp://ftp.example.com/path/file.txt; \ > | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat} > - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due > to a MalformedURLException: > {noformat} > bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \ >"ftp://ftp.example.com/path/file.txt; > ... > Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: > java.net.MalformedURLException > at > org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113) > ...{noformat} > The issue is caused by NUTCH-2429: > - we do not provide a dedicated URL stream handler for ftp URLs > - but also do not pass ftp:// URLs to the standard JVM handler -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3039) Failure to handle ftp:// URLs
Sebastian Nagel created NUTCH-3039: -- Summary: Failure to handle ftp:// URLs Key: NUTCH-3039 URL: https://issues.apache.org/jira/browse/NUTCH-3039 Project: Nutch Issue Type: Bug Components: plugin, protocol Affects Versions: 1.19 Reporter: Sebastian Nagel Fix For: 1.21 Nutch fails to handle ftp:// URLs: - URLNormalizerBasic returns the empty string because creating the URL instance fails with a MalformedURLException: {noformat} echo "ftp://ftp.example.com/path/file.txt; \ | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat} - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due to a MalformedURLException: {noformat} bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \ "ftp://ftp.example.com/path/file.txt; ... Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException at org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113) ...{noformat} The issue is caused by NUTCH-2429: - we do not provide a dedicated URL stream handler for ftp URLs - but also do not pass ftp:// URLs to the standard JVM handler -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode
[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2937. Resolution: Fixed Fixed NUTCH-2959 by using the shaded Tika package. Thanks, [~tallison]! > parse-tika: review dependency exclusions and avoid dependency conflicts in > distributed mode > --- > > Key: NUTCH-2937 > URL: https://issues.apache.org/jira/browse/NUTCH-2937 > Project: Nutch > Issue Type: Bug > Components: parser, plugin >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > While testing NUTCH-2919 I've seen the following error caused by a > conflicting dependency to commons-io: > - 2.11.0 Nutch core > - 2.11.0 parse-tika (excluded to avoid duplicated dependencies) > - 2.5 provided by Hadoop > This causes errors parsing some office and other documents (but not all), for > example: > {noformat} > 2022-01-15 01:36:31,365 WARN [FetcherThread] > org.apache.nutch.parse.ParseUtil: Error parsing > http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92) > at > org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431) > Caused by: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) > at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode
[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2937: -- Assignee: Tim Allison > parse-tika: review dependency exclusions and avoid dependency conflicts in > distributed mode > --- > > Key: NUTCH-2937 > URL: https://issues.apache.org/jira/browse/NUTCH-2937 > Project: Nutch > Issue Type: Bug > Components: parser, plugin >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Tim Allison >Priority: Major > Fix For: 1.20 > > > While testing NUTCH-2919 I've seen the following error caused by a > conflicting dependency to commons-io: > - 2.11.0 Nutch core > - 2.11.0 parse-tika (excluded to avoid duplicated dependencies) > - 2.5 provided by Hadoop > This causes errors parsing some office and other documents (but not all), for > example: > {noformat} > 2022-01-15 01:36:31,365 WARN [FetcherThread] > org.apache.nutch.parse.ParseUtil: Error parsing > http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92) > at > org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431) > Caused by: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) > at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode
[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2937: --- Fix Version/s: 1.20 (was: 1.21) > parse-tika: review dependency exclusions and avoid dependency conflicts in > distributed mode > --- > > Key: NUTCH-2937 > URL: https://issues.apache.org/jira/browse/NUTCH-2937 > Project: Nutch > Issue Type: Bug > Components: parser, plugin >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > While testing NUTCH-2919 I've seen the following error caused by a > conflicting dependency to commons-io: > - 2.11.0 Nutch core > - 2.11.0 parse-tika (excluded to avoid duplicated dependencies) > - 2.5 provided by Hadoop > This causes errors parsing some office and other documents (but not all), for > example: > {noformat} > 2022-01-15 01:36:31,365 WARN [FetcherThread] > org.apache.nutch.parse.ParseUtil: Error parsing > http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92) > at > org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431) > Caused by: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) > at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3005) Upgrade selenium as needed
[ https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3005. Resolution: Implemented Done by [~lewismc] as part of NUTCH-3036, commit [1563396|https://github.com/apache/nutch/blob/1563396d952393462fffab1f686e9ffd5d006cf6/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java#L151] . > Upgrade selenium as needed > -- > > Key: NUTCH-3005 > URL: https://issues.apache.org/jira/browse/NUTCH-3005 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Tim Allison >Priority: Trivial > Fix For: 1.20 > > > When we choose to upgrade selenium, we should take note of this blog about > changes in headless chromium: > https://www.selenium.dev/blog/2023/headless-is-going-away/ > ChromeOptions options = new ChromeOptions(); > options.addArguments("--headless=new"); > WebDriver driver = new ChromeDriver(options); > driver.get("https://selenium.dev;); > driver.quit(); -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2
[ https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3016. Resolution: Duplicate > Upgrade Apache Ivy to 2.5.2 > --- > > Key: NUTCH-3016 > URL: https://issues.apache.org/jira/browse/NUTCH-3016 > Project: Nutch > Issue Type: Task > Components: build, ivy >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > [Apache Ivy > v2.5.2|https://ant.apache.org/ivy/history/2.5.2/release-notes.html] was > released on August 20 2023! > We should upgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2
[ https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3016: --- Fix Version/s: 1.20 (was: 1.21) > Upgrade Apache Ivy to 2.5.2 > --- > > Key: NUTCH-3016 > URL: https://issues.apache.org/jira/browse/NUTCH-3016 > Project: Nutch > Issue Type: Task > Components: build, ivy >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > [Apache Ivy > v2.5.2|https://ant.apache.org/ivy/history/2.5.2/release-notes.html] was > released on August 20 2023! > We should upgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3005) Upgrade selenium as needed
[ https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3005: --- Affects Version/s: 1.19 > Upgrade selenium as needed > -- > > Key: NUTCH-3005 > URL: https://issues.apache.org/jira/browse/NUTCH-3005 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Tim Allison >Priority: Trivial > > When we choose to upgrade selenium, we should take note of this blog about > changes in headless chromium: > https://www.selenium.dev/blog/2023/headless-is-going-away/ > ChromeOptions options = new ChromeOptions(); > options.addArguments("--headless=new"); > WebDriver driver = new ChromeDriver(options); > driver.get("https://selenium.dev;); > driver.quit(); -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3005) Upgrade selenium as needed
[ https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3005: --- Fix Version/s: 1.20 > Upgrade selenium as needed > -- > > Key: NUTCH-3005 > URL: https://issues.apache.org/jira/browse/NUTCH-3005 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Tim Allison >Priority: Trivial > Fix For: 1.20 > > > When we choose to upgrade selenium, we should take note of this blog about > changes in headless chromium: > https://www.selenium.dev/blog/2023/headless-is-going-away/ > ChromeOptions options = new ChromeOptions(); > options.addArguments("--headless=new"); > WebDriver driver = new ChromeDriver(options); > driver.get("https://selenium.dev;); > driver.quit(); -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3028: --- Affects Version/s: 1.19 > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3028: --- Fix Version/s: 1.21 > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2960) indexer-elastic: remove plugin from binary package to address licensing issues
[ https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2960. Resolution: Won't Fix The license issue is addressed by NUTCH-3008. > indexer-elastic: remove plugin from binary package to address licensing issues > -- > > Key: NUTCH-2960 > URL: https://issues.apache.org/jira/browse/NUTCH-2960 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Priority: Major > > The license of Elasticsearch has changed with v7.11.0 and upwards and is (if > correct) not more compatible with the Apache license. Accordingly, we should > not further ship Elastic jars with the binary package. > It should be possible to keep the indexer-elastic plugin in the source > package as an [optional|https://www.apache.org/legal/resolved.html#optional] > dependency (indexer-solr is the default indexing backend and more are > available). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-2960) indexer-elastic: remove plugin from binary package to address licensing issues
[ https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-2960. -- > indexer-elastic: remove plugin from binary package to address licensing issues > -- > > Key: NUTCH-2960 > URL: https://issues.apache.org/jira/browse/NUTCH-2960 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Priority: Major > > The license of Elasticsearch has changed with v7.11.0 and upwards and is (if > correct) not more compatible with the Apache license. Accordingly, we should > not further ship Elastic jars with the binary package. > It should be possible to keep the indexer-elastic plugin in the source > package as an [optional|https://www.apache.org/legal/resolved.html#optional] > dependency (indexer-solr is the default indexing backend and more are > available). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2960) indexer-elastic: remove plugin from binary package to address licensing issues
[ https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2960: --- Fix Version/s: (was: 1.20) > indexer-elastic: remove plugin from binary package to address licensing issues > -- > > Key: NUTCH-2960 > URL: https://issues.apache.org/jira/browse/NUTCH-2960 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Priority: Major > > The license of Elasticsearch has changed with v7.11.0 and upwards and is (if > correct) not more compatible with the Apache license. Accordingly, we should > not further ship Elastic jars with the binary package. > It should be possible to keep the indexer-elastic plugin in the source > package as an [optional|https://www.apache.org/legal/resolved.html#optional] > dependency (indexer-solr is the default indexing backend and more are > available). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3008) indexer-elastic: downgrade to ES 7.10.2 to address licensing issues
[ https://issues.apache.org/jira/browse/NUTCH-3008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3008. Resolution: Fixed > indexer-elastic: downgrade to ES 7.10.2 to address licensing issues > --- > > Key: NUTCH-3008 > URL: https://issues.apache.org/jira/browse/NUTCH-3008 > Project: Nutch > Issue Type: Bug > Components: indexer, plugin >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > Downgrade to ES 7.10.2 (licensed under ASF 2.0) as an alternative solution to > address the licensing issues of the indexer-elastic plugin. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3029. Resolution: Implemented > Host specific max. and min. intervals in adaptive scheduler > --- > > Key: NUTCH-3029 > URL: https://issues.apache.org/jira/browse/NUTCH-3029 > Project: Nutch > Issue Type: New Feature >Affects Versions: 1.19, 1.20 >Reporter: Martin Djukanovic >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.20 > > Attachments: adaptive-host-specific-intervals.txt.template, > new_adaptive_fetch_schedule-1.patch > > > This patch implements custom max. and min. refetching intervals for specific > hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt > configuration file (template also attached). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-3029. -- > Host specific max. and min. intervals in adaptive scheduler > --- > > Key: NUTCH-3029 > URL: https://issues.apache.org/jira/browse/NUTCH-3029 > Project: Nutch > Issue Type: New Feature >Affects Versions: 1.19, 1.20 >Reporter: Martin Djukanovic >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.20 > > Attachments: adaptive-host-specific-intervals.txt.template, > new_adaptive_fetch_schedule-1.patch > > > This patch implements custom max. and min. refetching intervals for specific > hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt > configuration file (template also attached). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reopened NUTCH-3029: Assignee: Sebastian Nagel (was: Markus Jelsma) Reopen to update "Fix version(s)" - add 1.20, to make it appear in the release notes. > Host specific max. and min. intervals in adaptive scheduler > --- > > Key: NUTCH-3029 > URL: https://issues.apache.org/jira/browse/NUTCH-3029 > Project: Nutch > Issue Type: New Feature >Affects Versions: 1.19, 1.20 >Reporter: Martin Djukanovic >Assignee: Sebastian Nagel >Priority: Minor > Attachments: adaptive-host-specific-intervals.txt.template, > new_adaptive_fetch_schedule-1.patch > > > This patch implements custom max. and min. refetching intervals for specific > hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt > configuration file (template also attached). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3029: --- Fix Version/s: 1.20 > Host specific max. and min. intervals in adaptive scheduler > --- > > Key: NUTCH-3029 > URL: https://issues.apache.org/jira/browse/NUTCH-3029 > Project: Nutch > Issue Type: New Feature >Affects Versions: 1.19, 1.20 >Reporter: Martin Djukanovic >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.20 > > Attachments: adaptive-host-specific-intervals.txt.template, > new_adaptive_fetch_schedule-1.patch > > > This patch implements custom max. and min. refetching intervals for specific > hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt > configuration file (template also attached). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3035) Update license and notice file for release of 1.20
Sebastian Nagel created NUTCH-3035: -- Summary: Update license and notice file for release of 1.20 Key: NUTCH-3035 URL: https://issues.apache.org/jira/browse/NUTCH-3035 Project: Nutch Issue Type: Bug Components: documentation Affects Versions: 1.20 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Fix For: 1.20 Close to the release of 1.20 the license and notice files should be updated to contain all (third-party) licenses of all dependencies. Cf. NUTCH-2290 and NUTCH-2981. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL
[ https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3025. Resolution: Implemented > urlfilter-fast to filter based on the length of the URL > --- > > Key: NUTCH-3025 > URL: https://issues.apache.org/jira/browse/NUTCH-3025 > Project: Nutch > Issue Type: Improvement > Components: plugin, urlfilter >Affects Versions: 1.19 >Reporter: Julien Nioche >Priority: Major > Fix For: 1.20 > > > There currently is no filter implementation to remove URLs based on their > length or the length of their path / query. > Doing so with the regex filter would be inefficient, instead we could > implement it in _urlfilter-fast _ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL
[ https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3025: --- Component/s: plugin urlfilter > urlfilter-fast to filter based on the length of the URL > --- > > Key: NUTCH-3025 > URL: https://issues.apache.org/jira/browse/NUTCH-3025 > Project: Nutch > Issue Type: Improvement > Components: plugin, urlfilter >Affects Versions: 1.19 >Reporter: Julien Nioche >Priority: Major > Fix For: 1.20 > > > There currently is no filter implementation to remove URLs based on their > length or the length of their path / query. > Doing so with the regex filter would be inefficient, instead we could > implement it in _urlfilter-fast _ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784030#comment-17784030 ] Sebastian Nagel commented on NUTCH-3017: Thanks, [~jnioche] > Allow fast-urlfilter to load from HDFS/S3 and support gzipped input > --- > > Key: NUTCH-3017 > URL: https://issues.apache.org/jira/browse/NUTCH-3017 > Project: Nutch > Issue Type: Improvement > Components: plugin, urlfilter >Affects Versions: 1.19 >Reporter: Julien Nioche >Priority: Minor > Fix For: 1.20 > > > This provide an easier way to refresh the resources since no rebuild of the > jar will be needed. The path can point to either HDFS or S3. Additionally, > .gz files should be handled automatically -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3017. Resolution: Implemented > Allow fast-urlfilter to load from HDFS/S3 and support gzipped input > --- > > Key: NUTCH-3017 > URL: https://issues.apache.org/jira/browse/NUTCH-3017 > Project: Nutch > Issue Type: Improvement > Components: plugin, urlfilter >Affects Versions: 1.19 >Reporter: Julien Nioche >Priority: Minor > Fix For: 1.20 > > > This provide an easier way to refresh the resources since no rebuild of the > jar will be needed. The path can point to either HDFS or S3. Additionally, > .gz files should be handled automatically -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3017: --- Component/s: plugin urlfilter > Allow fast-urlfilter to load from HDFS/S3 and support gzipped input > --- > > Key: NUTCH-3017 > URL: https://issues.apache.org/jira/browse/NUTCH-3017 > Project: Nutch > Issue Type: Improvement > Components: plugin, urlfilter >Affects Versions: 1.19 >Reporter: Julien Nioche >Priority: Minor > Fix For: 1.20 > > > This provide an easier way to refresh the resources since no rebuild of the > jar will be needed. The path can point to either HDFS or S3. Additionally, > .gz files should be handled automatically -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3017: --- Fix Version/s: 1.20 > Allow fast-urlfilter to load from HDFS/S3 and support gzipped input > --- > > Key: NUTCH-3017 > URL: https://issues.apache.org/jira/browse/NUTCH-3017 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Julien Nioche >Priority: Minor > Fix For: 1.20 > > > This provide an easier way to refresh the resources since no rebuild of the > jar will be needed. The path can point to either HDFS or S3. Additionally, > .gz files should be handled automatically -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents
[ https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3012. Resolution: Fixed > SegmentReader when dumping with option -recode: NPE on unparsed documents > - > > Key: NUTCH-3012 > URL: https://issues.apache.org/jira/browse/NUTCH-3012 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > SegmentReader when called with the flag {{-recode}} fails with a NPE when > trying to stringify the raw content of unparsed documents: > {noformat} > $> bin/nutch readseg -dump crawl/segments/20231009065431 > crawl/segreader/20231009065431 -recode > ... > 2023-10-09 07:55:18,451 INFO mapreduce.Job: Task Id : > attempt_1696825862783_0005_r_00_0, Status : FAILED > Error: java.lang.NullPointerException: charset > at java.base/java.lang.String.(String.java:504) > at java.base/java.lang.String.(String.java:561) > at org.apache.nutch.protocol.Content.toString(Content.java:297) > at > org.apache.nutch.segment.SegmentReader$InputCompatReducer.reduce(SegmentReader.java:189) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3011) HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)
[ https://issues.apache.org/jira/browse/NUTCH-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3011. Resolution: Implemented > HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors > (HTTP 5xx) > > > Key: NUTCH-3011 > URL: https://issues.apache.org/jira/browse/NUTCH-3011 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > HttpRobotRulesParser should handle HTTP 429 Too Many Requests same as server > errors (HTTP 5xx), that is if configured signalize Fetcher to delay requests. > See also NUTCH-2573 and > https://support.google.com/webmasters/answer/9679690#robots_details -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
[ https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2990. Resolution: Implemented Thanks, everybody! > HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309 > --- > > Key: NUTCH-2990 > URL: https://issues.apache.org/jira/browse/NUTCH-2990 > Project: Nutch > Issue Type: Improvement > Components: protocol, robots >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > The robots.txt parser > ([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html]) > follows only one redirect when fetching the robots.txt while the robots.txt > RFC 9309 recommends to follow 5 redirects: > {quote} 2.3.1.2. Redirects > It's possible that a server responds to a robots.txt fetch request with a > redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers > SHOULD follow at least five consecutive redirects, even across authorities > (for example, hosts in the case of HTTP). > If a robots.txt file is reached within five consecutive redirects, the > robots.txt file MUST be fetched, parsed, and its rules followed in the > context of the initial authority. If there are more than five consecutive > redirects, crawlers MAY assume that the robots.txt file is unavailable. > (https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote} > While following redirects, the parser should check whether the redirect > location is itself a "/robots.txt" on a different host and then try to read > it from the cache. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (NUTCH-3009) Upgrade to Hadoop 3.3.6
[ https://issues.apache.org/jira/browse/NUTCH-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-3009: -- Assignee: Sebastian Nagel > Upgrade to Hadoop 3.3.6 > --- > > Key: NUTCH-3009 > URL: https://issues.apache.org/jira/browse/NUTCH-3009 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > Upgrade to [Hadoop 3.3.6|https://hadoop.apache.org/release/3.3.6.html], the > latest available release of Hadoop (release date: 2023-06-23). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3009) Upgrade to Hadoop 3.3.6
[ https://issues.apache.org/jira/browse/NUTCH-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3009. Resolution: Implemented > Upgrade to Hadoop 3.3.6 > --- > > Key: NUTCH-3009 > URL: https://issues.apache.org/jira/browse/NUTCH-3009 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > Upgrade to [Hadoop 3.3.6|https://hadoop.apache.org/release/3.3.6.html], the > latest available release of Hadoop (release date: 2023-06-23). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3006) Downgrade Tika dependency to 2.2.1 (core and parse-tika)
[ https://issues.apache.org/jira/browse/NUTCH-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3006. Fix Version/s: (was: 1.20) Resolution: Abandoned > Downgrade Tika dependency to 2.2.1 (core and parse-tika) > > > Key: NUTCH-3006 > URL: https://issues.apache.org/jira/browse/NUTCH-3006 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Priority: Major > > Tika 2.3.0 and upwards depend on a commons-io 2.11.0 (or even higher) which > is not available when Nutch is used on Hadoop. Only Hadoop 3.4.0 is expected > to ship with commons-io 2.11.0 (HADOOP-18301), all currently released > versions provide commons-io 2.8.0. Because Hadoop-required dependencies are > enforced in (pseudo)distributed mode, using Tika may cause issues, see > NUTCH-2937 and NUTCH-2959. > [~lewismc] suggested in the discussion of [Githup PR > #776|https://github.com/apache/nutch/pull/776] to downgrade to Tika 2.2.1 to > resolve these issues for now and until Hadoop 3.4.0 becomes available. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (NUTCH-3002) Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive
[ https://issues.apache.org/jira/browse/NUTCH-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-3002: -- Assignee: Sebastian Nagel > Protocol-okhttp HttpResponse: HTTP header metadata lookup should be > case-insensitive > > > Key: NUTCH-3002 > URL: https://issues.apache.org/jira/browse/NUTCH-3002 > Project: Nutch > Issue Type: Bug > Components: metadata, plugin, protocol >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > Lookup of HTTP headers in the class HttpResponse should be case-insensitive - > for example, any "Location" header should be returned independent from the > casing sent by the sender. > While protocol-http uses the class SpellCheckedMetadata which provides > case-insensitive lookups (as part of the spell-checking functionality), > protocol-okhttp relies on the class Metadata which stores the metadata values > case-sensitive. > It's a good question, whether we still need to spell-check HTTP headers. > However, case-insensitive look-ups are definitely required. Especially, since > HTTP header names are case-insensitive in HTTP/2. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3002) Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive
[ https://issues.apache.org/jira/browse/NUTCH-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3002. Resolution: Fixed > Protocol-okhttp HttpResponse: HTTP header metadata lookup should be > case-insensitive > > > Key: NUTCH-3002 > URL: https://issues.apache.org/jira/browse/NUTCH-3002 > Project: Nutch > Issue Type: Bug > Components: metadata, plugin, protocol >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > Lookup of HTTP headers in the class HttpResponse should be case-insensitive - > for example, any "Location" header should be returned independent from the > casing sent by the sender. > While protocol-http uses the class SpellCheckedMetadata which provides > case-insensitive lookups (as part of the spell-checking functionality), > protocol-okhttp relies on the class Metadata which stores the metadata values > case-sensitive. > It's a good question, whether we still need to spell-check HTTP headers. > However, case-insensitive look-ups are definitely required. Especially, since > HTTP header names are case-insensitive in HTTP/2. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3014) Standardize NutchJob job names
[ https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778103#comment-17778103 ] Sebastian Nagel commented on NUTCH-3014: If there is a single data name/directory (CrawlDb, segment, etc.), using it as part of the additional info would make the job name more unique. Imagine a long list of generate - fetch - updatedb jobs: adding the segment for the "generator partition" and fetcher job, makes it easier to figure out where in the crawl workflow a job was located. If there are multiple workflows running in concurrently, the CrawlDb name/path would be also a useful discriminatory component. > Standardize NutchJob job names > -- > > Key: NUTCH-3014 > URL: https://issues.apache.org/jira/browse/NUTCH-3014 > Project: Nutch > Issue Type: Improvement > Components: configuration, runtime >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > There is a large degree of variability when we set the job name{{{}{}}} > Â > {{Job job = NutchJob.getInstance(getConf());}} > {{job.setJobName("read " + segment);}} > Â > Some examples mention the job name, others don't. Some use upper case, others > don't, etc. > I think we can standardize the NutchJob job names. This would help when > filtering jobs in YARN ResourceManager UI as well. > I propose we implement the following convention > * *Nutch* (mandatory) - static value which prepends the job name, assists > with distinguishing the Job as a NutchJob and making it easily findable. > * *${ClassName}* (mandatory) - literally the name of the Class the job is > encoded in > * *${additional info}* (optional) - value could further distinguish the type > of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) > _*Nutch ${ClassName}*Â *${additional info}*_ > _Examples:_ > * _Nutch LinkRank Inverter_ > * _Nutch CrawlDb + $crawldb_ > * _Nutch LinkDbReader + $linkdb_ > Thanks for any suggestions/comments. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents
[ https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3012: --- Description: SegmentReader when called with the flag {{-recode}} fails with a NPE when trying to stringify the raw content of unparsed documents: {noformat} $> bin/nutch readseg -dump crawl/segments/20231009065431 crawl/segreader/20231009065431 -recode ... 2023-10-09 07:55:18,451 INFO mapreduce.Job: Task Id : attempt_1696825862783_0005_r_00_0, Status : FAILED Error: java.lang.NullPointerException: charset at java.base/java.lang.String.(String.java:504) at java.base/java.lang.String.(String.java:561) at org.apache.nutch.protocol.Content.toString(Content.java:297) at org.apache.nutch.segment.SegmentReader$InputCompatReducer.reduce(SegmentReader.java:189) {noformat} > SegmentReader when dumping with option -recode: NPE on unparsed documents > - > > Key: NUTCH-3012 > URL: https://issues.apache.org/jira/browse/NUTCH-3012 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > SegmentReader when called with the flag {{-recode}} fails with a NPE when > trying to stringify the raw content of unparsed documents: > {noformat} > $> bin/nutch readseg -dump crawl/segments/20231009065431 > crawl/segreader/20231009065431 -recode > ... > 2023-10-09 07:55:18,451 INFO mapreduce.Job: Task Id : > attempt_1696825862783_0005_r_00_0, Status : FAILED > Error: java.lang.NullPointerException: charset > at java.base/java.lang.String.(String.java:504) > at java.base/java.lang.String.(String.java:561) > at org.apache.nutch.protocol.Content.toString(Content.java:297) > at > org.apache.nutch.segment.SegmentReader$InputCompatReducer.reduce(SegmentReader.java:189) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents
[ https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3012: --- Summary: SegmentReader when dumping with option -recode: NPE on unparsed documents (was: SegmentReader when dumping with option -recode: NPE on documents without charset defined) > SegmentReader when dumping with option -recode: NPE on unparsed documents > - > > Key: NUTCH-3012 > URL: https://issues.apache.org/jira/browse/NUTCH-3012 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on documents without charset defined
Sebastian Nagel created NUTCH-3012: -- Summary: SegmentReader when dumping with option -recode: NPE on documents without charset defined Key: NUTCH-3012 URL: https://issues.apache.org/jira/browse/NUTCH-3012 Project: Nutch Issue Type: Bug Components: segment Affects Versions: 1.19 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Fix For: 1.20 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771445#comment-17771445 ] Sebastian Nagel commented on NUTCH-2959: Hi [~tallison], it's your decision whether the time is worth to spend. If you want to continue, I'm happy to test it. Personally, I'd just wait for Hadoop 3.4.0 which means for now downgrading to Tika 2.2.1 and likely have a hug jump forward in the included Tika version only with Nutch 1.21. We have the open PR: anybody, working in local mode and relying on a newer Tika version, can pick the PR. Unfortunately (or not?), Hadoop is sometimes conservative or slow with upgrades, same for the support of Java 17 (NUTCH-2987 / HADOOP-17177). > Upgrade to Apache Tika 2.9.0 > > > Key: NUTCH-2959 > URL: https://issues.apache.org/jira/browse/NUTCH-2959 > Project: Nutch > Issue Type: Task >Affects Versions: 1.19 >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-2959.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-1130) JUnit test for Any23 RDF plugin
[ https://issues.apache.org/jira/browse/NUTCH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1130. Resolution: Won't Do Closing - the any23 project has retired and the any23 plugin was removed from Nutch (NUTCH-2998). > JUnit test for Any23 RDF plugin > --- > > Key: NUTCH-1130 > URL: https://issues.apache.org/jira/browse/NUTCH-1130 > Project: Nutch > Issue Type: Sub-task > Components: build >Affects Versions: 1.4 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > > The JUnit test should be written prior to the progression of the Any23 Nutch > plugin -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-1130) JUnit test for Any23 RDF plugin
[ https://issues.apache.org/jira/browse/NUTCH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-1130. -- > JUnit test for Any23 RDF plugin > --- > > Key: NUTCH-1130 > URL: https://issues.apache.org/jira/browse/NUTCH-1130 > Project: Nutch > Issue Type: Sub-task > Components: build >Affects Versions: 1.4 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > > The JUnit test should be written prior to the progression of the Any23 Nutch > plugin -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2938) Use Any23's RepositoryWriter to write structured data to Rdf4j repository
[ https://issues.apache.org/jira/browse/NUTCH-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2938. Resolution: Won't Do Closing - the any23 project has retired and the any23 plugin was removed from Nutch (NUTCH-2998). See also the comment in the linked PR. > Use Any23's RepositoryWriter to write structured data to Rdf4j repository > - > > Key: NUTCH-2938 > URL: https://issues.apache.org/jira/browse/NUTCH-2938 > Project: Nutch > Issue Type: Improvement > Components: any23, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > > I have been running a patch which leverages [Any23's > RepositoryWriter|https://any23.apache.org/apidocs/org/apache/any23/writer/RepositoryWriter.html] > (implemented as one of a number of TripleHandler's via > [CompositeTripleHandler|https://any23.apache.org/apidocs/org/apache/any23/writer/CompositeTripleHandler.html]) > to write Any23 extractions to > [GraphDB|https://www.ontotext.com/products/graphdb/]. This enables us to > build a content graph from data across the enterprise. > This feature is turned off by default so will not change existing Any23 > behaviour. I have concerns about the performance of this patch because right > now we need to create a new repository connection for each URL. This is not > great so I will definitely improve on it. > PR coming up. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-2938) Use Any23's RepositoryWriter to write structured data to Rdf4j repository
[ https://issues.apache.org/jira/browse/NUTCH-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-2938. -- > Use Any23's RepositoryWriter to write structured data to Rdf4j repository > - > > Key: NUTCH-2938 > URL: https://issues.apache.org/jira/browse/NUTCH-2938 > Project: Nutch > Issue Type: Improvement > Components: any23, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > > I have been running a patch which leverages [Any23's > RepositoryWriter|https://any23.apache.org/apidocs/org/apache/any23/writer/RepositoryWriter.html] > (implemented as one of a number of TripleHandler's via > [CompositeTripleHandler|https://any23.apache.org/apidocs/org/apache/any23/writer/CompositeTripleHandler.html]) > to write Any23 extractions to > [GraphDB|https://www.ontotext.com/products/graphdb/]. This enables us to > build a content graph from data across the enterprise. > This feature is turned off by default so will not change existing Any23 > behaviour. I have concerns about the performance of this patch because right > now we need to create a new repository connection for each URL. This is not > great so I will definitely improve on it. > PR coming up. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2938) Use Any23's RepositoryWriter to write structured data to Rdf4j repository
[ https://issues.apache.org/jira/browse/NUTCH-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2938: --- Fix Version/s: (was: 1.20) > Use Any23's RepositoryWriter to write structured data to Rdf4j repository > - > > Key: NUTCH-2938 > URL: https://issues.apache.org/jira/browse/NUTCH-2938 > Project: Nutch > Issue Type: Improvement > Components: any23, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > > I have been running a patch which leverages [Any23's > RepositoryWriter|https://any23.apache.org/apidocs/org/apache/any23/writer/RepositoryWriter.html] > (implemented as one of a number of TripleHandler's via > [CompositeTripleHandler|https://any23.apache.org/apidocs/org/apache/any23/writer/CompositeTripleHandler.html]) > to write Any23 extractions to > [GraphDB|https://www.ontotext.com/products/graphdb/]. This enables us to > build a content graph from data across the enterprise. > This feature is turned off by default so will not change existing Any23 > behaviour. I have concerns about the performance of this patch because right > now we need to create a new repository connection for each URL. This is not > great so I will definitely improve on it. > PR coming up. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2853) bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean
[ https://issues.apache.org/jira/browse/NUTCH-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2853. Resolution: Fixed > bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean > - > > Key: NUTCH-2853 > URL: https://issues.apache.org/jira/browse/NUTCH-2853 > Project: Nutch > Issue Type: Improvement > Components: bin >Affects Versions: 1.18 >Reporter: Sebastian Nagel >Priority: Major > Labels: help-wanted > Fix For: 1.20 > > > The commands "solrindex", "solrdedup" and "solrclean" are deprecated since 7 > years and should be removed to avoid any confusions (one example: > https://stackoverflow.com/questions/66376609/nutch-solr-index-is-failing). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2897) Do not supress deprecated API warnings
[ https://issues.apache.org/jira/browse/NUTCH-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2897. Resolution: Fixed > Do not supress deprecated API warnings > -- > > Key: NUTCH-2897 > URL: https://issues.apache.org/jira/browse/NUTCH-2897 > Project: Nutch > Issue Type: Improvement > Components: documentation >Affects Versions: 1.18 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > We suppress deprecated warnings in three places > # > [Plugin.java#L92-L96|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/plugin/Plugin.java#L92-L96] > # > [NutchJob.java#L35-L38|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/NutchJob.java#L35-L38], > and > # > [TikaParser.java#L92-L95|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L92-L95] > Instead of suppressing the warnings we should instead use the correct > *@Deprecated* annotation and *@deprecated* Javadoc. This is not difficult to > do and should have been done first time around. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3010) Injector: count unique number of injected URLs
[ https://issues.apache.org/jira/browse/NUTCH-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3010. Resolution: Fixed > Injector: count unique number of injected URLs > -- > > Key: NUTCH-3010 > URL: https://issues.apache.org/jira/browse/NUTCH-3010 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > Injector uses two counters: one for the total number of injected URLs, the > other for the number of URLs "merged", that is already in CrawlDb. There is > now counter for the number of unique URLs injected which may lead to wrong > counts if the seed files contain duplicates: > Suppose the following seed file which contains a duplicated URL: > {noformat} > $> cat seeds_with_duplicates.txt > https://www.example.org/page1.html > https://www.example.org/page2.html > https://www.example.org/page2.html > $> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt > ... > 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls > rejected by filters: 0 > 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls > injected after normalization and filtering: 3 > 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls > injected but already in CrawlDb: 0 > 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls > injected: 3 > ... > {noformat} > However, because of the duplicated URL, only 2 URLs were injected into the > CrawlDb: > {noformat} > $> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats > ... > 2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls: 2 > ... > {noformat} > If the Injector job is run again with the same input, we get the erroneous > output, that still one "new URL" was injected: > {noformat} > 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls > rejected by filters: 0 > 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls > injected after normalization and filtering: 3 > 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls > injected but already in CrawlDb: 2 > 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls > injected: 1 > {noformat} > This is because the urls_merged counter counts unique items, while > url_injected does not, and the shown number is the difference between both > counters. > Adding a counter to count the number of unique injected URLs will allow to > get the correct count of newly injected URLs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3011) HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)
Sebastian Nagel created NUTCH-3011: -- Summary: HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx) Key: NUTCH-3011 URL: https://issues.apache.org/jira/browse/NUTCH-3011 Project: Nutch Issue Type: Improvement Affects Versions: 1.19 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Fix For: 1.20 HttpRobotRulesParser should handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx), that is if configured signalize Fetcher to delay requests. See also NUTCH-2573 and https://support.google.com/webmasters/answer/9679690#robots_details -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-1373) Implement consistent execution of normalising and filtering in Generator
[ https://issues.apache.org/jira/browse/NUTCH-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-1373. -- > Implement consistent execution of normalising and filtering in Generator > > > Key: NUTCH-1373 > URL: https://issues.apache.org/jira/browse/NUTCH-1373 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.4 >Reporter: Lewis John McGibbney >Priority: Minor > > As per discussion here [0] this issue should address the inconsistencies we > see in the scheduled execution of normalising and filtering between Nutchgora > Generator Mapper and trunk Generator mapper/reducer. > Hopefully we can come to some consensus as to the best approach acorss both > dists. > [0] http://www.mail-archive.com/user%40nutch.apache.org/msg06360.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-1373) Implement consistent execution of normalising and filtering in Generator
[ https://issues.apache.org/jira/browse/NUTCH-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1373. Resolution: Abandoned Closing as Nutch 2.x (aka. nutchgora) isn't maintained anymore. > Implement consistent execution of normalising and filtering in Generator > > > Key: NUTCH-1373 > URL: https://issues.apache.org/jira/browse/NUTCH-1373 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.4 >Reporter: Lewis John McGibbney >Priority: Minor > > As per discussion here [0] this issue should address the inconsistencies we > see in the scheduled execution of normalising and filtering between Nutchgora > Generator Mapper and trunk Generator mapper/reducer. > Hopefully we can come to some consensus as to the best approach acorss both > dists. > [0] http://www.mail-archive.com/user%40nutch.apache.org/msg06360.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-1374) Workaround for license headers
[ https://issues.apache.org/jira/browse/NUTCH-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770833#comment-17770833 ] Sebastian Nagel commented on NUTCH-1374: The package.html files were replaced by package-info.java containing a license header in NUTCH-2849 > Workaround for license headers > -- > > Key: NUTCH-1374 > URL: https://issues.apache.org/jira/browse/NUTCH-1374 > Project: Nutch > Issue Type: Task > Components: documentation >Affects Versions: 1.4, nutchgora >Reporter: Lewis John McGibbney >Priority: Major > > Currently in both versions of Nutch we have two types of files which DO NOT > contain license headers; namely all package.html files and the test files > within the language detection plugin. On my initial tests, adding license > headers to the language test files breaks the tests so we need to find a > workaround (or the correct synatx) to add commented out license headers to > these files. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-1635) New crawldb sometimes ends up in current
[ https://issues.apache.org/jira/browse/NUTCH-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770831#comment-17770831 ] Sebastian Nagel commented on NUTCH-1635: Hi [~markus17], did this continue to happen in the last years? Esp., after upgrading the MapReduce API (NUTCH-2375). > New crawldb sometimes ends up in current > > > Key: NUTCH-1635 > URL: https://issues.apache.org/jira/browse/NUTCH-1635 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Markus Jelsma >Priority: Major > > In some weird cases the newly created crawldb by updatedb ends up in > crawl/crawldb/current//. So instead of replacing current/, it ends up > inside current/! This causes the generator to fail. > It's impossible to reliably reproduce the problem. It only happened a couple > of times in the last few years. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-1947) Overhaul o.a.n.parse.OutlinkExtractor.java
[ https://issues.apache.org/jira/browse/NUTCH-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1947. Resolution: Abandoned Closing because OutlinkExtractor has seen many updates since then: upgrade to Java 8, replacement of Apache ORO to java.util.regex, etc. > Overhaul o.a.n.parse.OutlinkExtractor.java > --- > > Key: NUTCH-1947 > URL: https://issues.apache.org/jira/browse/NUTCH-1947 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 2.3, 1.9 >Reporter: Lewis John McGibbney >Priority: Major > > Right now in both trunk and 2.X, the > [OutlinkExtractor.java|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java] > class need a bit of TLC. It is referencing JDK1.5 in a few places, there are > misleading URL entries and it boasts some interesting @Deprecation methods > which we could ideally remove. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-1947) Overhaul o.a.n.parse.OutlinkExtractor.java
[ https://issues.apache.org/jira/browse/NUTCH-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-1947. -- > Overhaul o.a.n.parse.OutlinkExtractor.java > --- > > Key: NUTCH-1947 > URL: https://issues.apache.org/jira/browse/NUTCH-1947 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 2.3, 1.9 >Reporter: Lewis John McGibbney >Priority: Major > > Right now in both trunk and 2.X, the > [OutlinkExtractor.java|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java] > class need a bit of TLC. It is referencing JDK1.5 in a few places, there are > misleading URL entries and it boasts some interesting @Deprecation methods > which we could ideally remove. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2053) Uncessary dependencies included in ivy.xml (post NUTCH-2038)
[ https://issues.apache.org/jira/browse/NUTCH-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2053. Resolution: Abandoned Closing this old issue (8 years), assuming that dependencies have been updated and cleaned up multiple times since then. > Uncessary dependencies included in ivy.xml (post NUTCH-2038) > > > Key: NUTCH-2053 > URL: https://issues.apache.org/jira/browse/NUTCH-2053 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 1.11 >Reporter: Lewis John McGibbney >Priority: Major > > Currently in trunk we have an unnecessary dependency included within > ivy/ivy.xml > https://github.com/apache/nutch/blob/trunk/ivy/ivy.xml#L99-L101 > This needs to be removed. > [~asitang] can you please provide context as to why this is OK? I don't want > to break your code so sorry for lack of understanding. Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-2053) Uncessary dependencies included in ivy.xml (post NUTCH-2038)
[ https://issues.apache.org/jira/browse/NUTCH-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-2053. -- > Uncessary dependencies included in ivy.xml (post NUTCH-2038) > > > Key: NUTCH-2053 > URL: https://issues.apache.org/jira/browse/NUTCH-2053 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 1.11 >Reporter: Lewis John McGibbney >Priority: Major > > Currently in trunk we have an unnecessary dependency included within > ivy/ivy.xml > https://github.com/apache/nutch/blob/trunk/ivy/ivy.xml#L99-L101 > This needs to be removed. > [~asitang] can you please provide context as to why this is OK? I don't want > to break your code so sorry for lack of understanding. Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2423) Update contributor info page
[ https://issues.apache.org/jira/browse/NUTCH-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2423. Fix Version/s: (was: 1.20) Resolution: Fixed The wiki pages were updated in 2020 and 2021. Thanks for reporting, [~krichter] ! > Update contributor info page > > > Key: NUTCH-2423 > URL: https://issues.apache.org/jira/browse/NUTCH-2423 > Project: Nutch > Issue Type: Task > Components: documentation, wiki >Reporter: Karl-Philipp Richter >Priority: Major > Labels: easytask, help-wanted > > The [contributor info > page](https://wiki.apache.org/nutch/Becoming_A_Nutch_Developer) still > mentions subversion as SCM which I assume is obsolete because there's > git://git.apache.org/nutch.git. It should mention how the devs with write > access deal with pull/merge requests in general or on different popular > platforms (the information that they're not accepted is valuable as well). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2820) Review sample files used in any23 unit tests
[ https://issues.apache.org/jira/browse/NUTCH-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2820. Resolution: Resolved Resolved with the removal of the any23 plugin (NUTCH-2998). > Review sample files used in any23 unit tests > > > Key: NUTCH-2820 > URL: https://issues.apache.org/jira/browse/NUTCH-2820 > Project: Nutch > Issue Type: Bug > Components: plugin >Affects Versions: 1.17 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.20 > > > The sample files used by unit tests of the any23 plugin include content not > applicable to the Apache license. These should removed or stripped to a > minimal snippet (mostly HTML markup). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2888) Selenium Protocol: Support for Selenium 4
[ https://issues.apache.org/jira/browse/NUTCH-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2888. Resolution: Duplicate Thanks, [~mmkivist]! This issue was resolved by NUTCH-2980 and will be included in the 1.20 release of Nutch. > Selenium Protocol: Support for Selenium 4 > - > > Key: NUTCH-2888 > URL: https://issues.apache.org/jira/browse/NUTCH-2888 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.18 >Reporter: Mikko Kivistoe >Priority: Minor > Fix For: 1.20 > > > Hi, > Selenium 4 is out and it's Grid version supports now HTTPS traffic between > the Hub and Nodes. The Selenium 4 api has changed, and it would be good to > have Nutch compatible with it -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2888) Selenium Protocol: Support for Selenium 4
[ https://issues.apache.org/jira/browse/NUTCH-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2888: --- Affects Version/s: 1.18 > Selenium Protocol: Support for Selenium 4 > - > > Key: NUTCH-2888 > URL: https://issues.apache.org/jira/browse/NUTCH-2888 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.18 >Reporter: Mikko Kivistoe >Priority: Minor > Fix For: 1.20 > > > Hi, > Selenium 4 is out and it's Grid version supports now HTTPS traffic between > the Hub and Nodes. The Selenium 4 api has changed, and it would be good to > have Nutch compatible with it -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2888) Selenium Protocol: Support for Selenium 4
[ https://issues.apache.org/jira/browse/NUTCH-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2888: --- Fix Version/s: 1.20 > Selenium Protocol: Support for Selenium 4 > - > > Key: NUTCH-2888 > URL: https://issues.apache.org/jira/browse/NUTCH-2888 > Project: Nutch > Issue Type: New Feature > Components: protocol >Reporter: Mikko Kivistoe >Priority: Minor > Fix For: 1.20 > > > Hi, > Selenium 4 is out and it's Grid version supports now HTTPS traffic between > the Hub and Nodes. The Selenium 4 api has changed, and it would be good to > have Nutch compatible with it -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3007) Fix impossible casts
[ https://issues.apache.org/jira/browse/NUTCH-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3007. Resolution: Fixed Thanks for the review, [~markus17]! > Fix impossible casts > > > Key: NUTCH-3007 > URL: https://issues.apache.org/jira/browse/NUTCH-3007 > Project: Nutch > Issue Type: Sub-task >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > Spotbugs reports two occurrences of > Impossible cast from java.util.ArrayList to String[] in > org.apache.nutch.fetcher.Fetcher.run(Map, String) > Both were introduced later into the {{run(Map args, String > crawlId)}} method and obviously never used (would throw a > ClassCastException). The code blocks should be removed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2852) Method invokes System.exit(...) 9 bugs
[ https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2852. Resolution: Fixed > Method invokes System.exit(...) 9 bugs > -- > > Key: NUTCH-2852 > URL: https://issues.apache.org/jira/browse/NUTCH-2852 > Project: Nutch > Issue Type: Sub-task >Affects Versions: 1.18 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > org.apache.nutch.indexer.IndexingFiltersChecker since first historized release > In class org.apache.nutch.indexer.IndexingFiltersChecker > In method org.apache.nutch.indexer.IndexingFiltersChecker.run(String[]) > At IndexingFiltersChecker.java:[line 96] > Another occurrence at IndexingFiltersChecker.java:[line 129] > org.apache.nutch.indexer.IndexingFiltersChecker.run(String[]) invokes > System.exit(...), which shuts down the entire virtual machine > Invoking System.exit shuts down the entire Java virtual machine. This should > only been done when it is appropriate. Such calls make it hard or impossible > for your code to be invoked by other code. Consider throwing a > RuntimeException instead. > Also occurs in >org.apache.nutch.net.URLFilterChecker since first historized release >org.apache.nutch.net.URLNormalizerChecker since first historized release >org.apache.nutch.parse.ParseSegment since first historized release >org.apache.nutch.parse.ParserChecker since first historized release >org.apache.nutch.service.NutchServer since first historized release >org.apache.nutch.tools.CommonCrawlDataDumper since first historized release >org.apache.nutch.tools.DmozParser since first historized release >org.apache.nutch.util.AbstractChecker since first historized release -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3010) Injector: count unique number of injected URLs
[ https://issues.apache.org/jira/browse/NUTCH-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3010: --- Description: Injector uses two counters: one for the total number of injected URLs, the other for the number of URLs "merged", that is already in CrawlDb. There is now counter for the number of unique URLs injected which may lead to wrong counts if the seed files contain duplicates: Suppose the following seed file which contains a duplicated URL: {noformat} $> cat seeds_with_duplicates.txt https://www.example.org/page1.html https://www.example.org/page2.html https://www.example.org/page2.html $> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt ... 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls rejected by filters: 0 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls injected after normalization and filtering: 3 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls injected but already in CrawlDb: 0 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls injected: 3 ... {noformat} However, because of the duplicated URL, only 2 URLs were injected into the CrawlDb: {noformat} $> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats ... 2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls: 2 ... {noformat} If the Injector job is run again with the same input, we get the erroneous output, that still one "new URL" was injected: {noformat} 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls rejected by filters: 0 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls injected after normalization and filtering: 3 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls injected but already in CrawlDb: 2 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls injected: 1 {noformat} This is because the urls_merged counter counts unique items, while url_injected does not, and the shown number is the difference between both counters. Adding a counter to count the number of unique injected URLs will allow to get the correct count of newly injected URLs. was: Injector uses two counters: one for the total number of injected URLs, the other for the number of URLs "merged", that is already in CrawlDb. There is now counter for the number of unique URLs injected which may lead to wrong counts if the seed files contain duplicates: Suppose the following seed file which contains a duplicated URL: {noformat} $> cat seeds_with_duplicates.txt https://www.example.org/page1.html https://www.example.org/page2.html https://www.example.org/page2.html $> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt ... 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls rejected by filters: 0 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls injected after normalization and filtering: 3 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls injected but already in CrawlDb: 0 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls injected: 3 ... {noformat} However, because of the duplicated URL, only 2 URLs were injected into the CrawlDb: {noformat} $> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats ... 2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls: 2 ... {noformat} If the Injector job is run again with the same input, we get the erroneous output, that still one "new URL" was injected: {noformat} 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls rejected by filters: 0 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls injected after normalization and filtering: 3 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls injected but already in CrawlDb: 2 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls injected: 1 {noformat} This is because the urls_merged counter counts unique items, while url_injected does not. Adding a counter to count the number of unique injected URLs will allow to get the correct count of newly injected URLs. > Injector: count unique number of injected URLs > -- > > Key: NUTCH-3010 > URL: https://issues.apache.org/jira/browse/NUTCH-3010 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > Injector uses two counters: one for the total number of injected URLs, the > other for the number of URLs "merged", that is already in CrawlDb. There is > now counter for the number of
[jira] [Created] (NUTCH-3010) Injector: count unique number of injected URLs
Sebastian Nagel created NUTCH-3010: -- Summary: Injector: count unique number of injected URLs Key: NUTCH-3010 URL: https://issues.apache.org/jira/browse/NUTCH-3010 Project: Nutch Issue Type: Improvement Components: injector Affects Versions: 1.19 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Fix For: 1.20 Injector uses two counters: one for the total number of injected URLs, the other for the number of URLs "merged", that is already in CrawlDb. There is now counter for the number of unique URLs injected which may lead to wrong counts if the seed files contain duplicates: Suppose the following seed file which contains a duplicated URL: {noformat} $> cat seeds_with_duplicates.txt https://www.example.org/page1.html https://www.example.org/page2.html https://www.example.org/page2.html $> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt ... 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls rejected by filters: 0 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls injected after normalization and filtering: 3 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls injected but already in CrawlDb: 0 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls injected: 3 ... {noformat} However, because of the duplicated URL, only 2 URLs were injected into the CrawlDb: {noformat} $> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats ... 2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls: 2 ... {noformat} If the Injector job is run again with the same input, we get the erroneous output, that still one "new URL" was injected: {noformat} 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls rejected by filters: 0 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls injected after normalization and filtering: 3 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls injected but already in CrawlDb: 2 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls injected: 1 {noformat} This is because the urls_merged counter counts unique items, while url_injected does not. Adding a counter to count the number of unique injected URLs will allow to get the correct count of newly injected URLs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3006) Downgrade Tika dependency to 2.2.1 (core and parse-tika)
[ https://issues.apache.org/jira/browse/NUTCH-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770320#comment-17770320 ] Sebastian Nagel commented on NUTCH-3006: > revert CloseShieldInputStream.wrap(), which I think was the only conflict Yes, looks like it was the only conflict. If it's an option to revert this, yes, why not. The idea of the downgrade was more to avoid that this issue blocks any release. And downgrading from 2.3.0 (current master) to 2.2.1 sounds less dramatic. > how far out Hadoop 3.4.0 is Even if it's released, it takes some time (a couple of months) until Hadoop distributions (for example Apache Bigtop) pick the release and/or users deploy it. > Downgrade Tika dependency to 2.2.1 (core and parse-tika) > > > Key: NUTCH-3006 > URL: https://issues.apache.org/jira/browse/NUTCH-3006 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > Tika 2.3.0 and upwards depend on a commons-io 2.11.0 (or even higher) which > is not available when Nutch is used on Hadoop. Only Hadoop 3.4.0 is expected > to ship with commons-io 2.11.0 (HADOOP-18301), all currently released > versions provide commons-io 2.8.0. Because Hadoop-required dependencies are > enforced in (pseudo)distributed mode, using Tika may cause issues, see > NUTCH-2937 and NUTCH-2959. > [~lewismc] suggested in the discussion of [Githup PR > #776|https://github.com/apache/nutch/pull/776] to downgrade to Tika 2.2.1 to > resolve these issues for now and until Hadoop 3.4.0 becomes available. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2979) Upgrade Commons Text to 1.10.0
[ https://issues.apache.org/jira/browse/NUTCH-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770041#comment-17770041 ] Sebastian Nagel commented on NUTCH-2979: Note: upgrading to Hadoop 3.3.6 (NUTCH-3009) will update the core dependency to commons-text 1.10.0 > Upgrade Commons Text to 1.10.0 > -- > > Key: NUTCH-2979 > URL: https://issues.apache.org/jira/browse/NUTCH-2979 > Project: Nutch > Issue Type: Bug > Components: build, plugin >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Priority: Major > Labels: help-wanted > Fix For: 1.20 > > > In order to address > [CVE-2022-42889|https://nvd.nist.gov/vuln/detail/CVE-2022-42889] we should > upgrade to commons-text 1.10.0: > - Nutch core depends on 1.4 which is not affected by the CVE > - the plugins lib-htmlunit and any23 depend on a vulnerable commons-text > version (1.5 - 1.9) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3009) Upgrade to Hadoop 3.3.6
Sebastian Nagel created NUTCH-3009: -- Summary: Upgrade to Hadoop 3.3.6 Key: NUTCH-3009 URL: https://issues.apache.org/jira/browse/NUTCH-3009 Project: Nutch Issue Type: Improvement Affects Versions: 1.19 Reporter: Sebastian Nagel Fix For: 1.20 Upgrade to [Hadoop 3.3.6|https://hadoop.apache.org/release/3.3.6.html], the latest available release of Hadoop (release date: 2023-06-23). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2979) Upgrade Commons Text to 1.10.0
[ https://issues.apache.org/jira/browse/NUTCH-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2979. Resolution: Fixed Resolved, so far, without any direct action: - Nutch core still depends on 1.4 which is not affected by the CVE - the plugin any23 was removed (NUTCH-2998) - the plugin lib-htmlunit now depends on commons-text 1.10.0 after the Selenium dependency was upgraded by NUTCH-2980 > Upgrade Commons Text to 1.10.0 > -- > > Key: NUTCH-2979 > URL: https://issues.apache.org/jira/browse/NUTCH-2979 > Project: Nutch > Issue Type: Bug > Components: build, plugin >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Priority: Major > Labels: help-wanted > Fix For: 1.20 > > > In order to address > [CVE-2022-42889|https://nvd.nist.gov/vuln/detail/CVE-2022-42889] we should > upgrade to commons-text 1.10.0: > - Nutch core depends on 1.4 which is not affected by the CVE > - the plugins lib-htmlunit and any23 depend on a vulnerable commons-text > version (1.5 - 1.9) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3008) indexer-elastic: downgrade to ES 7.10.2 to address licensing issues
Sebastian Nagel created NUTCH-3008: -- Summary: indexer-elastic: downgrade to ES 7.10.2 to address licensing issues Key: NUTCH-3008 URL: https://issues.apache.org/jira/browse/NUTCH-3008 Project: Nutch Issue Type: Bug Components: indexer, plugin Affects Versions: 1.19 Reporter: Sebastian Nagel Fix For: 1.20 Downgrade to ES 7.10.2 (licensed under ASF 2.0) as an alternative solution to address the licensing issues of the indexer-elastic plugin. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3007) Fix impossible casts
Sebastian Nagel created NUTCH-3007: -- Summary: Fix impossible casts Key: NUTCH-3007 URL: https://issues.apache.org/jira/browse/NUTCH-3007 Project: Nutch Issue Type: Sub-task Affects Versions: 1.19 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Fix For: 1.20 Spotbugs reports two occurrences of Impossible cast from java.util.ArrayList to String[] in org.apache.nutch.fetcher.Fetcher.run(Map, String) Both were introduced later into the {{run(Map args, String crawlId)}} method and obviously never used (would throw a ClassCastException). The code blocks should be removed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2852) Method invokes System.exit(...) 9 bugs
[ https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769977#comment-17769977 ] Sebastian Nagel commented on NUTCH-2852: The PR addresses all corresponding issues in the checker tools. That's everything without investing too much time: DmozParser and CommonCrawlDataDumper would need a closer look, for NutchServer I don't know how to stop it gracefully. > Method invokes System.exit(...) 9 bugs > -- > > Key: NUTCH-2852 > URL: https://issues.apache.org/jira/browse/NUTCH-2852 > Project: Nutch > Issue Type: Sub-task >Affects Versions: 1.18 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > org.apache.nutch.indexer.IndexingFiltersChecker since first historized release > In class org.apache.nutch.indexer.IndexingFiltersChecker > In method org.apache.nutch.indexer.IndexingFiltersChecker.run(String[]) > At IndexingFiltersChecker.java:[line 96] > Another occurrence at IndexingFiltersChecker.java:[line 129] > org.apache.nutch.indexer.IndexingFiltersChecker.run(String[]) invokes > System.exit(...), which shuts down the entire virtual machine > Invoking System.exit shuts down the entire Java virtual machine. This should > only been done when it is appropriate. Such calls make it hard or impossible > for your code to be invoked by other code. Consider throwing a > RuntimeException instead. > Also occurs in >org.apache.nutch.net.URLFilterChecker since first historized release >org.apache.nutch.net.URLNormalizerChecker since first historized release >org.apache.nutch.parse.ParseSegment since first historized release >org.apache.nutch.parse.ParserChecker since first historized release >org.apache.nutch.service.NutchServer since first historized release >org.apache.nutch.tools.CommonCrawlDataDumper since first historized release >org.apache.nutch.tools.DmozParser since first historized release >org.apache.nutch.util.AbstractChecker since first historized release -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3006) Downgrade Tika dependency to 2.2.1 (core and parse-tika)
Sebastian Nagel created NUTCH-3006: -- Summary: Downgrade Tika dependency to 2.2.1 (core and parse-tika) Key: NUTCH-3006 URL: https://issues.apache.org/jira/browse/NUTCH-3006 Project: Nutch Issue Type: Bug Affects Versions: 1.20 Reporter: Sebastian Nagel Fix For: 1.20 Tika 2.3.0 and upwards depend on a commons-io 2.11.0 (or even higher) which is not available when Nutch is used on Hadoop. Only Hadoop 3.4.0 is expected to ship with commons-io 2.11.0 (HADOOP-18301), all currently released versions provide commons-io 2.8.0. Because Hadoop-required dependencies are enforced in (pseudo)distributed mode, using Tika may cause issues, see NUTCH-2937 and NUTCH-2959. [~lewismc] suggested in the discussion of [Githup PR #776|https://github.com/apache/nutch/pull/776] to downgrade to Tika 2.2.1 to resolve these issues for now and until Hadoop 3.4.0 becomes available. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3004) Avoid NPE in HttpResponse
[ https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3004: --- Fix Version/s: 1.20 > Avoid NPE in HttpResponse > - > > Key: NUTCH-3004 > URL: https://issues.apache.org/jira/browse/NUTCH-3004 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Tim Allison >Priority: Trivial > Fix For: 1.20 > > > I recently deployed nutch on a FIPS enabled rhel 8 instance, and I got an NPE > in HttpResponse. When I set the log level to debug, I could see what was > happening, but it would have been better to get a meaningful exception rather > than an NPE. > The issue is that in the catch clause, the exception is propagated only if > the message is "handshake alert..." and then the reconnect fails. If the > message is not that, then the ssl socket remains null, and we get an NPE > below the source I quote here. > I think we should throw the same HTTPException that we do throw in the nested > try if the message is not "handshake alert..." > {code:java} > try { > sslsocket = getSSLSocket(socket, sockHost, sockPort); > sslsocket.startHandshake(); > } catch (Exception e) { > Http.LOG.debug("SSL connection to {} failed with: {}", url, > e.getMessage()); > if ("handshake alert: unrecognized_name".equals(e.getMessage())) { > try { > // Reconnect, see NUTCH-2447 > socket = new Socket(); > socket.setSoTimeout(http.getTimeout()); > socket.connect(sockAddr, http.getTimeout()); > sslsocket = getSSLSocket(socket, "", sockPort); > sslsocket.startHandshake(); > } catch (Exception ex) { > String msg = "SSL reconnect to " + url + " failed with: " > + e.getMessage(); > throw new HttpException(msg); > } > } > } > socket = sslsocket; > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3004) Avoid NPE in HttpResponse
[ https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3004: --- Component/s: plugin protocol > Avoid NPE in HttpResponse > - > > Key: NUTCH-3004 > URL: https://issues.apache.org/jira/browse/NUTCH-3004 > Project: Nutch > Issue Type: Improvement > Components: plugin, protocol >Affects Versions: 1.19 >Reporter: Tim Allison >Priority: Trivial > Fix For: 1.20 > > > I recently deployed nutch on a FIPS enabled rhel 8 instance, and I got an NPE > in HttpResponse. When I set the log level to debug, I could see what was > happening, but it would have been better to get a meaningful exception rather > than an NPE. > The issue is that in the catch clause, the exception is propagated only if > the message is "handshake alert..." and then the reconnect fails. If the > message is not that, then the ssl socket remains null, and we get an NPE > below the source I quote here. > I think we should throw the same HTTPException that we do throw in the nested > try if the message is not "handshake alert..." > {code:java} > try { > sslsocket = getSSLSocket(socket, sockHost, sockPort); > sslsocket.startHandshake(); > } catch (Exception e) { > Http.LOG.debug("SSL connection to {} failed with: {}", url, > e.getMessage()); > if ("handshake alert: unrecognized_name".equals(e.getMessage())) { > try { > // Reconnect, see NUTCH-2447 > socket = new Socket(); > socket.setSoTimeout(http.getTimeout()); > socket.connect(sockAddr, http.getTimeout()); > sslsocket = getSSLSocket(socket, "", sockPort); > sslsocket.startHandshake(); > } catch (Exception ex) { > String msg = "SSL reconnect to " + url + " failed with: " > + e.getMessage(); > throw new HttpException(msg); > } > } > } > socket = sslsocket; > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3004) Avoid NPE in HttpResponse
[ https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3004: --- Affects Version/s: 1.19 > Avoid NPE in HttpResponse > - > > Key: NUTCH-3004 > URL: https://issues.apache.org/jira/browse/NUTCH-3004 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Tim Allison >Priority: Trivial > > I recently deployed nutch on a FIPS enabled rhel 8 instance, and I got an NPE > in HttpResponse. When I set the log level to debug, I could see what was > happening, but it would have been better to get a meaningful exception rather > than an NPE. > The issue is that in the catch clause, the exception is propagated only if > the message is "handshake alert..." and then the reconnect fails. If the > message is not that, then the ssl socket remains null, and we get an NPE > below the source I quote here. > I think we should throw the same HTTPException that we do throw in the nested > try if the message is not "handshake alert..." > {code:java} > try { > sslsocket = getSSLSocket(socket, sockHost, sockPort); > sslsocket.startHandshake(); > } catch (Exception e) { > Http.LOG.debug("SSL connection to {} failed with: {}", url, > e.getMessage()); > if ("handshake alert: unrecognized_name".equals(e.getMessage())) { > try { > // Reconnect, see NUTCH-2447 > socket = new Socket(); > socket.setSoTimeout(http.getTimeout()); > socket.connect(sockAddr, http.getTimeout()); > sslsocket = getSSLSocket(socket, "", sockPort); > sslsocket.startHandshake(); > } catch (Exception ex) { > String msg = "SSL reconnect to " + url + " failed with: " > + e.getMessage(); > throw new HttpException(msg); > } > } > } > socket = sslsocket; > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-585: -- Priority: Major (was: Minor) > [PARSE-HTML plugin] Block certain parts of HTML code from being indexed > --- > > Key: NUTCH-585 > URL: https://issues.apache.org/jira/browse/NUTCH-585 > Project: Nutch > Issue Type: Improvement > Components: HTML, parse-filter, parser, plugin >Affects Versions: 0.9.0 > Environment: All operating systems >Reporter: Andrea Spinelli >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > Attachments: blacklist_whitelist_plugin.patch, > nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch > > > We are using nutch to index our own web sites; we would like not to index > certain parts of our pages, because we know they are not relevant (for > instance, there are several links to change the background color) and > generate spurious matches. > We have modified the plugin so that it ignores HTML code between certain HTML > comments, like > > ... ignored part ... > > We feel this might be useful to someone else, maybe factorizing the comment > strings as constants in the configuration files (say parser.html.ignore.start > and parser.html.ignore.stop in nutch-site.xml). > We are almost ready to contribute our code snippet. Looking forward for any > expression of interest - or for an explanation why waht we are doing is > plain wrong! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-585: -- Component/s: parse-filter HTML parser plugin > [PARSE-HTML plugin] Block certain parts of HTML code from being indexed > --- > > Key: NUTCH-585 > URL: https://issues.apache.org/jira/browse/NUTCH-585 > Project: Nutch > Issue Type: Improvement > Components: HTML, parse-filter, parser, plugin >Affects Versions: 0.9.0 > Environment: All operating systems >Reporter: Andrea Spinelli >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.20 > > Attachments: blacklist_whitelist_plugin.patch, > nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch > > > We are using nutch to index our own web sites; we would like not to index > certain parts of our pages, because we know they are not relevant (for > instance, there are several links to change the background color) and > generate spurious matches. > We have modified the plugin so that it ignores HTML code between certain HTML > comments, like > > ... ignored part ... > > We feel this might be useful to someone else, maybe factorizing the comment > strings as constants in the configuration files (say parser.html.ignore.start > and parser.html.ignore.stop in nutch-site.xml). > We are almost ready to contribute our code snippet. Looking forward for any > expression of interest - or for an explanation why waht we are doing is > plain wrong! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-585: - Assignee: Sebastian Nagel (was: Markus Jelsma) > [PARSE-HTML plugin] Block certain parts of HTML code from being indexed > --- > > Key: NUTCH-585 > URL: https://issues.apache.org/jira/browse/NUTCH-585 > Project: Nutch > Issue Type: Improvement >Affects Versions: 0.9.0 > Environment: All operating systems >Reporter: Andrea Spinelli >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.20 > > Attachments: blacklist_whitelist_plugin.patch, > nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch > > > We are using nutch to index our own web sites; we would like not to index > certain parts of our pages, because we know they are not relevant (for > instance, there are several links to change the background color) and > generate spurious matches. > We have modified the plugin so that it ignores HTML code between certain HTML > comments, like > > ... ignored part ... > > We feel this might be useful to someone else, maybe factorizing the comment > strings as constants in the configuration files (say parser.html.ignore.start > and parser.html.ignore.stop in nutch-site.xml). > We are almost ready to contribute our code snippet. Looking forward for any > expression of interest - or for an explanation why waht we are doing is > plain wrong! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-585: -- Fix Version/s: 1.20 > [PARSE-HTML plugin] Block certain parts of HTML code from being indexed > --- > > Key: NUTCH-585 > URL: https://issues.apache.org/jira/browse/NUTCH-585 > Project: Nutch > Issue Type: Improvement >Affects Versions: 0.9.0 > Environment: All operating systems >Reporter: Andrea Spinelli >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: blacklist_whitelist_plugin.patch, > nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch > > > We are using nutch to index our own web sites; we would like not to index > certain parts of our pages, because we know they are not relevant (for > instance, there are several links to change the background color) and > generate spurious matches. > We have modified the plugin so that it ignores HTML code between certain HTML > comments, like > > ... ignored part ... > > We feel this might be useful to someone else, maybe factorizing the comment > strings as constants in the configuration files (say parser.html.ignore.start > and parser.html.ignore.stop in nutch-site.xml). > We are almost ready to contribute our code snippet. Looking forward for any > expression of interest - or for an explanation why waht we are doing is > plain wrong! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3002) Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive
Sebastian Nagel created NUTCH-3002: -- Summary: Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive Key: NUTCH-3002 URL: https://issues.apache.org/jira/browse/NUTCH-3002 Project: Nutch Issue Type: Bug Components: metadata, plugin, protocol Affects Versions: 1.19 Reporter: Sebastian Nagel Fix For: 1.20 Lookup of HTTP headers in the class HttpResponse should be case-insensitive - for example, any "Location" header should be returned independent from the casing sent by the sender. While protocol-http uses the class SpellCheckedMetadata which provides case-insensitive lookups (as part of the spell-checking functionality), protocol-okhttp relies on the class Metadata which stores the metadata values case-sensitive. It's a good question, whether we still need to spell-check HTTP headers. However, case-insensitive look-ups are definitely required. Especially, since HTTP header names are case-insensitive in HTTP/2. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3000) protocol-selenium returns only the body,strips off the element
[ https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764692#comment-17764692 ] Sebastian Nagel commented on NUTCH-3000: +1 Yes, the full HTML seems the best choice for the default. > protocol-selenium returns only the body,strips off the element > -- > > Key: NUTCH-3000 > URL: https://issues.apache.org/jira/browse/NUTCH-3000 > Project: Nutch > Issue Type: Bug > Components: protocol >Reporter: Tim Allison >Priority: Major > > The selenium protocol returns only the body portion of the html, which means > that neither the title nor the other page metadata in the section > gets extracted. > {noformat} > String innerHtml = driver.findElement(By.tagName("body")) > .getAttribute("innerHTML"); > {noformat} > We should return the full html, no? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (NUTCH-2998) Remove the Any23 plugin
[ https://issues.apache.org/jira/browse/NUTCH-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764672#comment-17764672 ] Sebastian Nagel edited comment on NUTCH-2998 at 9/13/23 1:26 PM: - +1 > Are there other lists/communications channels I should pursue with this? A short notice to user@/dev@ might good. was (Author: wastl-nagel): +1 > Remove the Any23 plugin > --- > > Key: NUTCH-2998 > URL: https://issues.apache.org/jira/browse/NUTCH-2998 > Project: Nutch > Issue Type: Task > Components: any23 >Reporter: Tim Allison >Priority: Major > > I'm not sure how we want to handle this. Any23 moved to the Attic in June > 2023. We should probably remove it from Nutch? I'm not sure how abruptly we > want to do that. > We could deprecate it for 1.20 and then remove it in 1.21 or later? Or we > could choose to remove it for 1.20. > What do you think? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2998) Remove the Any23 plugin
[ https://issues.apache.org/jira/browse/NUTCH-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764672#comment-17764672 ] Sebastian Nagel commented on NUTCH-2998: +1 > Remove the Any23 plugin > --- > > Key: NUTCH-2998 > URL: https://issues.apache.org/jira/browse/NUTCH-2998 > Project: Nutch > Issue Type: Task > Components: any23 >Reporter: Tim Allison >Priority: Major > > I'm not sure how we want to handle this. Any23 moved to the Attic in June > 2023. We should probably remove it from Nutch? I'm not sure how abruptly we > want to do that. > We could deprecate it for 1.20 and then remove it in 1.21 or later? Or we > could choose to remove it for 1.20. > What do you think? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2997) Add Override annotations where applicable
[ https://issues.apache.org/jira/browse/NUTCH-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2997. Resolution: Implemented > Add Override annotations where applicable > - > > Key: NUTCH-2997 > URL: https://issues.apache.org/jira/browse/NUTCH-2997 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Trivial > Fix For: 1.20 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (NUTCH-2997) Add Override annotations where applicable
[ https://issues.apache.org/jira/browse/NUTCH-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2997: -- Assignee: Sebastian Nagel > Add Override annotations where applicable > - > > Key: NUTCH-2997 > URL: https://issues.apache.org/jira/browse/NUTCH-2997 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Trivial > Fix For: 1.20 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2996) Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)
[ https://issues.apache.org/jira/browse/NUTCH-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2996. Resolution: Implemented > Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4) > > > Key: NUTCH-2996 > URL: https://issues.apache.org/jira/browse/NUTCH-2996 > Project: Nutch > Issue Type: Improvement > Components: robots >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > Crawler-commons 1.4 (#1085) robots.txt parser (SimpleRobotRulesParser) > introduces a new [API entry point to parse the robots.txt > content|https://crawler-commons.github.io/crawler-commons/1.4/crawlercommons/robots/SimpleRobotRulesParser.html#parseContent(java.lang.String,byte%5B%5D,java.lang.String,java.util.Collection)]: > - it's more efficient by accepting a collection of lower-cased, single-word > user-agent product tokens, without the need to tokenize a (comma-separated) > list of user-agent strings again with every robots.txt > - user-agent matching is compliant with [RFC 9309 (section > 2.2.1)|https://www.rfc-editor.org/rfc/rfc9309.html#name-the-user-agent-line] > only if the new API method is used -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2995) Upgrade to crawler-commons 1.4
[ https://issues.apache.org/jira/browse/NUTCH-2995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2995. Resolution: Implemented > Upgrade to crawler-commons 1.4 > -- > > Key: NUTCH-2995 > URL: https://issues.apache.org/jira/browse/NUTCH-2995 > Project: Nutch > Issue Type: Improvement > Components: robots >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2993: --- Component/s: plugin scoring Affects Version/s: 1.19 > ScoringDepth plugin to skip depth check based on URL Pattern > > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement > Components: plugin, scoring >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. > This patch overrides maxDepth for outlinks of URLs matching a configured > pattern. URL not matching the pattern get the default max depth value > configured. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2993. Resolution: Implemented Committed/merged. Thanks, [~markus17]! > ScoringDepth plugin to skip depth check based on URL Pattern > > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement > Components: plugin, scoring >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. > This patch overrides maxDepth for outlinks of URLs matching a configured > pattern. URL not matching the pattern get the default max depth value > configured. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-2997) Add Override annotations where applicable
Sebastian Nagel created NUTCH-2997: -- Summary: Add Override annotations where applicable Key: NUTCH-2997 URL: https://issues.apache.org/jira/browse/NUTCH-2997 Project: Nutch Issue Type: Improvement Affects Versions: 1.19 Reporter: Sebastian Nagel Fix For: 1.20 -- This message was sent by Atlassian Jira (v8.20.10#820010)