(nutch) branch master updated: NUTCH-3055 README: fix Github "hub" commands - replace "git" with "hub" were necessary - improve formatting of "contributing" steps

2024-05-28 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new ca03d9b76 NUTCH-3055 README: fix Github "hub" commands - replace "git" 
with "hub" were necessary - improve formatting of "contributing" steps
ca03d9b76 is described below

commit ca03d9b76485b7c9d50dff2c3946bb8189daf5e1
Author: Sebastian Nagel 
AuthorDate: Tue Apr 30 11:01:45 2024 +0200

NUTCH-3055 README: fix Github "hub" commands
- replace "git" with "hub" were necessary
- improve formatting of "contributing" steps
---
 README.md | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/README.md b/README.md
index 28acfe8c7..f1322aa5e 100644
--- a/README.md
+++ b/README.md
@@ -22,22 +22,21 @@ Contributing
 To contribute a patch, follow these instructions (note that installing
 [Hub](https://hub.github.com/) is not strictly required, but is recommended).
 
-```
 0. Download and install hub.github.com
 1. File JIRA issue for your fix at 
https://issues.apache.org/jira/projects/NUTCH/issues
-- you will get issue id NUTCH-xxx where xxx is the issue ID.
-2. git clone https://github.com/apache/nutch.git
-3. cd nutch
-4. git checkout -b NUTCH-xxx
+   - you will get issue id NUTCH- where  is the issue ID.
+2. `git clone https://github.com/apache/nutch.git`
+3. `cd nutch`
+4. `git checkout -b NUTCH-`
 5. edit files (please try and include a test case if possible)
-6. git status (make sure it shows what files you expected to edit)
+6. `git status` (make sure it shows what files you expected to edit)
 7. Make sure that your code complies with the [Nutch codeformatting 
template](https://raw.githubusercontent.com/apache/nutch/master/eclipse-codeformat.xml),
 which is basially two space indents
-8. git add 
-9. git commit -m “fix for NUTCH-xxx contributed by ”
-10. git fork
-11. git push -u  NUTCH-xxx
-12. git pull-request
-```
+8. `git add `
+9. `git commit -m "fix for NUTCH-xxx contributed by "`
+10. `hub fork` (if hub is not installed, you can fork the project using the 
"fork" button on the [Nutch Github project 
page](https://github.com/apache/nutch))
+11. `git push -u  NUTCH-`
+12. `hub pull-request` (if hub is not installed, please follow the 
instructions how to [create a pull-request from a 
fork](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork))
+
 
 IDE setup
 =



(nutch) branch master updated (8abc78a65 -> bfa07df29)

2024-05-28 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from 8abc78a65 NUTCH-3041 Address confusing logging in 
o.a.n.net.URLExemptionFilters (#813)
 add 4b263533a NUTCH-3044 Generator: NPE when extracting the host part of a 
URL fails
 add 4729786e4 NUTCH-3044 Generator: NPE when extracting the host part of a 
URL fails - add unit test to proof that URLs without a host part do not cause   
errors
 add b153279ad NUTCH-3044 Generator: NPE when extracting the host part of a 
URL fails - replace deprecated method call - improve and format Javadoc
 new bfa07df29 Merge pull request #815 from 
sebastian-nagel/NUTCH-3044-generator-npe

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 src/java/org/apache/nutch/crawl/Generator.java | 140 ++---
 src/test/org/apache/nutch/crawl/TestGenerator.java |  55 +++-
 2 files changed, 150 insertions(+), 45 deletions(-)



(nutch) 01/01: Merge pull request #815 from sebastian-nagel/NUTCH-3044-generator-npe

2024-05-28 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit bfa07df29f7b810365620abff06680eac9bcddf9
Merge: 8abc78a65 b153279ad
Author: Sebastian Nagel 
AuthorDate: Tue May 28 13:55:23 2024 +0200

Merge pull request #815 from sebastian-nagel/NUTCH-3044-generator-npe

NUTCH-3044 Generator: NPE when extracting the host part of a URL fails

 src/java/org/apache/nutch/crawl/Generator.java | 140 ++---
 src/test/org/apache/nutch/crawl/TestGenerator.java |  55 +++-
 2 files changed, 150 insertions(+), 45 deletions(-)




(nutch) branch master updated: NUTCH-3043 Generator: count URLs rejected by URL filters (#814)

2024-05-14 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 5f1330a03 NUTCH-3043 Generator: count URLs rejected by URL filters 
(#814)
5f1330a03 is described below

commit 5f1330a03d136440a167a85da6cfe8ac4b3f61b9
Author: Sebastian Nagel 
AuthorDate: Tue May 14 17:38:25 2024 +0200

NUTCH-3043 Generator: count URLs rejected by URL filters (#814)

- add counters URL_FILTERS_REJECTED and URL_FILTER_EXCEPTION
- simplify logging statement
- remove unnecessary cast
- use parameterized logging
---
 src/java/org/apache/nutch/crawl/Generator.java | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/java/org/apache/nutch/crawl/Generator.java 
b/src/java/org/apache/nutch/crawl/Generator.java
index 33f743a37..f57642a65 100644
--- a/src/java/org/apache/nutch/crawl/Generator.java
+++ b/src/java/org/apache/nutch/crawl/Generator.java
@@ -224,9 +224,12 @@ public class Generator extends NutchTool implements Tool {
 // If filtering is on don't generate URLs that don't pass
 // URLFilters
 try {
-  if (filters.filter(url.toString()) == null)
+  if (filters.filter(url.toString()) == null) {
+context.getCounter("Generator", 
"URL_FILTERS_REJECTED").increment(1);
 return;
+  }
 } catch (URLFilterException e) {
+  context.getCounter("Generator", "URL_FILTER_EXCEPTION").increment(1);
   LOG.warn("Couldn't filter url: {} ({})", url, e.getMessage());
 }
   }
@@ -253,10 +256,7 @@ public class Generator extends NutchTool implements Tool {
   try {
 sort = scfilters.generatorSortValue(key, crawlDatum, sort);
   } catch (ScoringFilterException sfe) {
-if (LOG.isWarnEnabled()) {
-  LOG.warn(
-  "Couldn't filter generatorSortValue for " + key + ": " + sfe);
-}
+LOG.warn("Couldn't filter generatorSortValue for {}: {}", key, sfe);
   }
 
   // check expr
@@ -625,7 +625,7 @@ public class Generator extends NutchTool implements Tool {
   // make later bytes more significant in hash code, so that sorting
   // by hashcode correlates less with by-host ordering.
   for (int i = length - 1; i >= 0; i--)
-hash = (31 * hash) + (int) bytes[start + i];
+hash = (31 * hash) + bytes[start + i];
   return hash;
 }
   }



(nutch) branch master updated: NUTCH-3039 Failure to handle ftp:// URLs

2024-05-14 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new ea9c7ee5d NUTCH-3039 Failure to handle ftp:// URLs
ea9c7ee5d is described below

commit ea9c7ee5d6635405b31b4a1d462cca746478b040
Author: Sebastian Nagel 
AuthorDate: Thu Apr 11 13:28:37 2024 +0200

NUTCH-3039 Failure to handle ftp:// URLs

Pass ftp:// URLs to the standard JVM URLStreamHandler
---
 src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java 
b/src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java
index bd7e377d0..0916f4c9d 100644
--- a/src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java
+++ b/src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java
@@ -72,9 +72,13 @@ public class URLStreamHandlerFactory
* Protocols covered by standard JVM URL handlers. These protocols must not 
be
* handled by Nutch plugins, in order to avoid that basic actions (eg. 
loading
* of classes and configuration files) break.
+   * 
+   * Also the "ftp" protocol is included: it's usually supported by the 
standard
+   * JVM URL handler and Nutch does not yet provide a dedicated URL stream
+   * handler.
*/
   public static final String[] SYSTEM_PROTOCOLS = { //
-  "http", "https", "file", "jar" };
+  "http", "https", "file", "jar", "ftp" };
 
   static {
 instance = new URLStreamHandlerFactory();



(nutch-site) branch asf-site updated: Revert incorrect change in doap.rdf (see #2)

2024-05-11 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 8456fb5  Revert incorrect change in doap.rdf (see #2)
8456fb5 is described below

commit 8456fb597e2dc3147312032298ac24d25a8a5632
Author: Sebastian Nagel 
AuthorDate: Sat May 11 20:30:51 2024 +0200

Revert incorrect change in doap.rdf (see #2)
---
 content/doap.rdf | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/doap.rdf b/content/doap.rdf
index 186fe9a..0799f8f 100644
--- a/content/doap.rdf
+++ b/content/doap.rdf
@@ -33,7 +33,7 @@
 https://nutch.apache.org/community/mailing-lists/; />
 https://www.apache.org/dyn/closer.cgi/nutch/; 
/>
 Java
-https://projects.apache.org/projects.html?category#web-framework; 
/>
+http://projects.apache.org/category/web-framework; 
/>
 
   
 Apache Nutch 1.20



(nutch-site) branch asf-staging updated: Revert incorrect change in doap.rdf (see #2)

2024-05-11 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/asf-staging by this push:
 new d7ac03a  Revert incorrect change in doap.rdf (see #2)
d7ac03a is described below

commit d7ac03a033e1db8f161e7dea236d482a2c2460ce
Author: Sebastian Nagel 
AuthorDate: Sat May 11 20:27:43 2024 +0200

Revert incorrect change in doap.rdf (see #2)
---
 content/doap.rdf | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/doap.rdf b/content/doap.rdf
index 186fe9a..0799f8f 100644
--- a/content/doap.rdf
+++ b/content/doap.rdf
@@ -33,7 +33,7 @@
 https://nutch.apache.org/community/mailing-lists/; />
 https://www.apache.org/dyn/closer.cgi/nutch/; 
/>
 Java
-https://projects.apache.org/projects.html?category#web-framework; 
/>
+http://projects.apache.org/category/web-framework; 
/>
 
   
 Apache Nutch 1.20



(nutch-site) branch main updated: Revert incorrect change (#2)

2024-05-11 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/main by this push:
 new c011a7e  Revert incorrect change (#2)
c011a7e is described below

commit c011a7eec90ad4ded0ea3a028419f63666da3aa8
Author: Sebb 
AuthorDate: Sat May 11 19:24:51 2024 +0100

Revert incorrect change (#2)
---
 content/doap.rdf | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/doap.rdf b/content/doap.rdf
index 186fe9a..0799f8f 100644
--- a/content/doap.rdf
+++ b/content/doap.rdf
@@ -33,7 +33,7 @@
 https://nutch.apache.org/community/mailing-lists/; />
 https://www.apache.org/dyn/closer.cgi/nutch/; 
/>
 Java
-https://projects.apache.org/projects.html?category#web-framework; 
/>
+http://projects.apache.org/category/web-framework; 
/>
 
   
 Apache Nutch 1.20



(nutch) branch master updated: NUTCH-3008 indexer-elastic: downgrade to ES 7.10.2 to address licensing issues

2024-03-14 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 367988dfd NUTCH-3008 indexer-elastic: downgrade to ES 7.10.2 to 
address licensing issues
367988dfd is described below

commit 367988dfd63751e05e10c93c4c32bd9f7c47b634
Author: Sebastian Nagel 
AuthorDate: Wed Mar 13 15:55:55 2024 +0100

NUTCH-3008 indexer-elastic: downgrade to ES 7.10.2 to address licensing 
issues
---
 src/plugin/indexer-elastic/howto_upgrade_es.md |  4 +--
 src/plugin/indexer-elastic/ivy.xml |  2 +-
 src/plugin/indexer-elastic/plugin.xml  | 37 +++---
 .../indexwriter/elastic/ElasticIndexWriter.java|  2 +-
 4 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/src/plugin/indexer-elastic/howto_upgrade_es.md 
b/src/plugin/indexer-elastic/howto_upgrade_es.md
index b57e0c02f..ca58639d1 100644
--- a/src/plugin/indexer-elastic/howto_upgrade_es.md
+++ b/src/plugin/indexer-elastic/howto_upgrade_es.md
@@ -37,7 +37,7 @@
  (eventually with different versions)
- duplicated libs can be added to the exclusions of transitive dependencies 
in
build/plugins/indexer-elastic/ivy.xml
-   - but it should be made sure that the library versions in ivy/ivy.xml 
correspend to
+   - but it should be made sure that the library versions in ivy/ivy.xml 
correspond to
  those required by Tika
 
 5. Remove the locally "installed" dependencies in 
src/plugin/indexer-elastic/lib/:
@@ -47,4 +47,4 @@
 6. Build Nutch and run all unit tests:
 
 $ cd ../../../
-$ ant clean runtime test
\ No newline at end of file
+$ ant clean runtime test
diff --git a/src/plugin/indexer-elastic/ivy.xml 
b/src/plugin/indexer-elastic/ivy.xml
index de59711a2..2a52fc62b 100644
--- a/src/plugin/indexer-elastic/ivy.xml
+++ b/src/plugin/indexer-elastic/ivy.xml
@@ -36,7 +36,7 @@
   
 
   
-
+
   
   
   
diff --git a/src/plugin/indexer-elastic/plugin.xml 
b/src/plugin/indexer-elastic/plugin.xml
index fc3723a60..b4f872375 100644
--- a/src/plugin/indexer-elastic/plugin.xml
+++ b/src/plugin/indexer-elastic/plugin.xml
@@ -22,18 +22,17 @@
 
 
 
-
-
+
 
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
 
 
 
@@ -43,10 +42,10 @@
 
 
 
-
-
+
+
 
-
+
 
 
 
@@ -58,12 +57,12 @@
 
 
 
-
 
+
 
-
-
-
+
+
+
 
 
 
@@ -74,4 +73,4 @@
   
 
   
-
\ No newline at end of file
+
diff --git 
a/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
 
b/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
index 290d9dfca..0cb267463 100644
--- 
a/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
+++ 
b/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
@@ -149,7 +149,7 @@ public class ElasticIndexWriter implements IndexWriter {
 .builder(
 (request, bulkListener) -> client.bulkAsync(request,
 RequestOptions.DEFAULT, bulkListener),
-bulkProcessorListener(), "nutch-indexer-elastic")
+bulkProcessorListener())
 .setBulkActions(maxBulkDocs)
 .setBulkSize(new ByteSizeValue(maxBulkLength, ByteSizeUnit.BYTES))
 .setConcurrentRequests(1)



(nutch) branch master updated: Update crawl documentation

2024-03-10 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 83acd501e Update crawl documentation
83acd501e is described below

commit 83acd501e0a873c906fdb542e2c5ee86787a15a2
Author: Jakob Berlin 
AuthorDate: Thu Dec 14 16:23:11 2023 +0100

Update crawl documentation

Show --dedup-group instead of -dedup-group which have lead to 
misunderstanding output
---
 src/bin/crawl | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/crawl b/src/bin/crawl
index db4221868..409f72799 100755
--- a/src/bin/crawl
+++ b/src/bin/crawl
@@ -48,7 +48,7 @@
 #   --time-limit-fetch  Number of minutes allocated to the 
fetching [default: 180]
 #   --num-threadsNumber of threads for fetching / 
sitemap processing [default: 50]
 #
-#   -dedup-groupDeduplication group method [default: 
none]
+#   --dedup-groupDeduplication group method 
[default: none]
 #
 
 function __to_seconds() {
@@ -109,7 +109,7 @@ function __print_usage {
   echo -e "  \t\t\t\t\t  - never [default]"
   echo -e "  \t\t\t\t\t  - always (processing takes place in every iteration)"
   echo -e "  \t\t\t\t\t  - once (processing only takes place in the first 
iteration)"
-  echo -e "  -dedup-group \tDeduplication group method 
[default: none]"
+  echo -e "  --dedup-group \tDeduplication group method 
[default: none]"
 
   exit 1
 }



(nutch) branch master updated (adadc43fb -> 7ad382d95)

2023-11-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from adadc43fb Merge branch 'NUTCH-3017', closes #793
 new d8e66ce87 [NUTCH-3025^Curlfilter-fast to filter based on the length of 
the URL
 new d764e4c16 Added filtering on whole string + documented config in 
nutch-default + fixed tests
 new 49d85eac7 Merged changes from master; improved Javadoc and exception 
handling
 new 7ad382d95 Merge pull request #796 from DigitalPebble/NUTCH-3025

The 3415 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 conf/nutch-default.xml | 24 
 src/plugin/urlfilter-fast/README.md|  6 ++
 .../apache/nutch/urlfilter/fast/FastURLFilter.java | 65 +-
 .../nutch/urlfilter/fast/TestFastURLFilter.java| 38 -
 4 files changed, 129 insertions(+), 4 deletions(-)



(nutch) 02/02: Merge branch 'NUTCH-3017', closes #793

2023-11-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit adadc43fb169793c47ab25a0eba99a5f20eda763
Merge: 90849124d ac383fc51
Author: Sebastian Nagel 
AuthorDate: Wed Nov 8 13:35:43 2023 +0100

Merge branch 'NUTCH-3017', closes #793

 conf/nutch-default.xml | 10 ++--
 .../apache/nutch/urlfilter/fast/FastURLFilter.java | 27 +++---
 2 files changed, 32 insertions(+), 5 deletions(-)



(nutch) 01/02: [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 and support gzipped input - use Hadoop-provided compression codecs - update description of property urlfilter.fast.file

2023-11-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit ac383fc5125b6c114a23ef996558ead57e873970
Author: Sebastian Nagel 
AuthorDate: Wed Nov 8 12:24:24 2023 +0100

[NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 and support gzipped 
input
- use Hadoop-provided compression codecs
- update description of property urlfilter.fast.file
---
 conf/nutch-default.xml | 10 --
 .../org/apache/nutch/urlfilter/fast/FastURLFilter.java | 14 --
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index d8bf76486..b20afdfe3 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -1872,8 +1872,14 @@ CAUTION: Set the parser.timeout to -1 or a bigger value 
than 30, when using this
 
   urlfilter.fast.file
   fast-urlfilter.txt
-  Name of file on CLASSPATH containing regular expressions
-  used by urlfilter-fast (FastURLFilter) plugin.
+  Name of file containing rules and regular expressions
+  used by urlfilter-fast (FastURLFilter) plugin. If the filename
+  includes a scheme (for example, hdfs://) it is loaded using the
+  Hadoop FileSystem implementation supporting that scheme. If the
+  filename does not contain a scheme, the file is loaded from
+  CLASSPATH. If indicated by file extension (.gz, .bzip2, .zst),
+  the file is decompressed while reading using Hadoop-provided
+  compression codecs.
 
 
 
diff --git 
a/src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java
 
b/src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java
index 79ad7b6ca..bb4a11b7c 100644
--- 
a/src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java
+++ 
b/src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java
@@ -21,6 +21,8 @@ import com.google.common.collect.Multimap;
 import org.apache.commons.lang.StringUtils;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.CompressionCodecFactory;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.nutch.net.URLFilter;
 import org.slf4j.Logger;
@@ -35,7 +37,6 @@ import java.io.Reader;
 import java.net.URL;
 import java.util.regex.Pattern;
 import java.util.regex.PatternSyntaxException;
-import java.util.zip.GZIPInputStream;
 
 /**
  * Filters URLs based on a file of regular expressions using host/domains
@@ -120,7 +121,7 @@ public class FastURLFilter implements URLFilter {
 try {
   reloadRules();
 } catch (Exception e) {
-  LOG.error(e.getMessage());
+  LOG.error("Failed to load rules: {}", e.getMessage()  );
   throw new RuntimeException(e.getMessage(), e);
 }
   }
@@ -193,13 +194,14 @@ public class FastURLFilter implements URLFilter {
 if (fileRulesPath.toUri().getScheme() != null) {
   FileSystem fs = fileRulesPath.getFileSystem(conf);
   is = fs.open(fileRulesPath);
-}
-else {
+} else {
   is = conf.getConfResourceAsInputStream(fileRules);
 }
 
-if (fileRules.endsWith(".gz")) {
-  is = new GZIPInputStream(is);
+CompressionCodec codec = new CompressionCodecFactory(conf)
+.getCodec(fileRulesPath);
+if (codec != null) {
+  is = codec.createInputStream(is);
 }
 
 reloadRules(new InputStreamReader(is));



(nutch) branch master updated (90849124d -> adadc43fb)

2023-11-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from 90849124d NUTCH-3020 -- ParseSegment should check for okhttp's 
truncation flag (#794)
 add d1025fd63 [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 and 
support gzipped input
 new ac383fc51 [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 and 
support gzipped input - use Hadoop-provided compression codecs - update 
description of property urlfilter.fast.file
 new adadc43fb Merge branch 'NUTCH-3017', closes #793

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 conf/nutch-default.xml | 10 ++--
 .../apache/nutch/urlfilter/fast/FastURLFilter.java | 27 +++---
 2 files changed, 32 insertions(+), 5 deletions(-)



[nutch] branch master updated: NUTCH-3012 SegmentReader when dumping with option -recode: NPE on unparsed documents - fall back to UTF-8 when stringifying the content of unparsed documents

2023-10-21 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new d2c3e96d8 NUTCH-3012 SegmentReader when dumping with option -recode: 
NPE on unparsed documents - fall back to UTF-8 when stringifying the content of 
unparsed documents
d2c3e96d8 is described below

commit d2c3e96d88818d8107f320c49e007329b020e090
Author: Sebastian Nagel 
AuthorDate: Mon Oct 9 10:21:01 2023 +0200

NUTCH-3012 SegmentReader when dumping with option -recode: NPE on unparsed 
documents
- fall back to UTF-8 when stringifying the content of unparsed documents
---
 src/java/org/apache/nutch/segment/SegmentReader.java | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/java/org/apache/nutch/segment/SegmentReader.java 
b/src/java/org/apache/nutch/segment/SegmentReader.java
index 14546af54..ee5c266fd 100644
--- a/src/java/org/apache/nutch/segment/SegmentReader.java
+++ b/src/java/org/apache/nutch/segment/SegmentReader.java
@@ -163,13 +163,16 @@ public class SegmentReader extends Configured implements 
Tool {
   dump.append("\nRecno:: ").append(recNo++).append("\n");
   dump.append("URL:: " + key.toString() + "\n");
   Content content = null;
-  Charset charset = null;
+  // fall-back encoding for content of unparsed documents
+  Charset charset = StandardCharsets.UTF_8;
   for (NutchWritable val : values) {
 Writable value = val.get(); // unwrap
 if (value instanceof CrawlDatum) {
   dump.append("\nCrawlDatum::\n").append(((CrawlDatum) 
value).toString());
 } else if (value instanceof Content) {
   if (recodeContent) {
+// output recoded content later when charset is extracted from HTML
+// metadata hold in ParseData
 content = (Content) value;
   } else {
 dump.append("\nContent::\n").append(((Content) value).toString());



[nutch] branch master updated: NUTCH-3011 HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)

2023-10-21 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new b081c75d8 NUTCH-3011 HttpRobotRulesParser: handle HTTP 429 Too Many 
Requests same as server errors (HTTP 5xx)
b081c75d8 is described below

commit b081c75d87be61e42297c952298b72eb7ff2a6dc
Author: Sebastian Nagel 
AuthorDate: Sun Oct 1 14:08:39 2023 +0200

NUTCH-3011 HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as 
server errors (HTTP 5xx)
---
 conf/nutch-default.xml| 11 ++-
 .../apache/nutch/protocol/http/api/HttpRobotRulesParser.java  |  3 ++-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 18ed56b03..d8bf76486 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -141,8 +141,9 @@
   http.robots.503.defer.visits
   true
   Temporarily suspend fetching from a host if the
-  robots.txt response is HTTP 503 or any other 5xx server error. See
-  also http.robots.503.defer.visits.delay and
+  robots.txt response is HTTP 503 or any other 5xx server error
+  and HTTP 429 Too Many Requests. See also
+  http.robots.503.defer.visits.delay and
   http.robots.503.defer.visits.retries
 
 
@@ -150,7 +151,7 @@
   http.robots.503.defer.visits.delay
   30
   Time in milliseconds to suspend crawling a host if the
-  robots.txt response is HTTP 5xx - see
+  robots.txt response is HTTP 5xx or 429 Too Many Requests - see
   http.robots.503.defer.visits.
 
 
@@ -158,8 +159,8 @@
   http.robots.503.defer.visits.retries
   3
   Number of retries crawling a host if the robots.txt
-  response is HTTP 5xx - see http.robots.503.defer.visits. After n
-  retries the host queue is dropped for this segment/cycle.
+  response is HTTP 5xx or 429 - see http.robots.503.defer.visits.
+  After n retries the host queue is dropped for this segment/cycle.
   
 
 
diff --git 
a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
 
b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
index 8d7263e3e..ec5e77e43 100644
--- 
a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
+++ 
b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
@@ -229,7 +229,8 @@ public class HttpRobotRulesParser extends RobotRulesParser {
 else if ((code == 403) && (!allowForbidden))
   robotRules = FORBID_ALL_RULES; // use forbid all
 
-else if (code >= 500) {
+else if (code >= 500 || code == 429) {
+  // 5xx server errors or 429 Too Many Requests
   cacheRule = false; // try again later to fetch robots.txt
   if (deferVisits503) {
 // signal fetcher to suspend crawling for this host



[nutch] branch master updated: NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309 (#779)

2023-10-21 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new ecdd19dbd NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as 
specified by RFC 9309 (#779)
ecdd19dbd is described below

commit ecdd19dbdd4424bf9b9bce206f23992140ee43fe
Author: Sebastian Nagel 
AuthorDate: Sat Oct 21 15:53:25 2023 +0200

NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 
9309 (#779)

- follow multiple redirects when fetching robots.txt
- number of followed redirects is configurable by the property
  http.robots.redirect.max (default: 5)

Improvements to RobotRulesParser's robots.txt test utility
- bug fix: the passed agent names need to be transferred
  to the property http.robots.agents earlier, before the
  protocol plugins are configured
- more verbose debug logging
---
 conf/nutch-default.xml |  10 ++
 .../apache/nutch/protocol/RobotRulesParser.java|  32 +++--
 .../protocol/http/api/HttpRobotRulesParser.java| 141 -
 3 files changed, 143 insertions(+), 40 deletions(-)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 58455b338..18ed56b03 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -163,6 +163,16 @@
   
 
 
+
+  http.robots.redirect.max
+  5
+  Maximum number of redirects followed when fetching
+  a robots.txt file. RFC 9309 specifies that crawlers SHOULD
+  follow at least five consecutive redirects, even across authorities
+  (for example, hosts in the case of HTTP).
+  
+
+
 
   http.agent.description
   
diff --git a/src/java/org/apache/nutch/protocol/RobotRulesParser.java 
b/src/java/org/apache/nutch/protocol/RobotRulesParser.java
index 562c2c694..d73c07506 100644
--- a/src/java/org/apache/nutch/protocol/RobotRulesParser.java
+++ b/src/java/org/apache/nutch/protocol/RobotRulesParser.java
@@ -98,6 +98,7 @@ public abstract class RobotRulesParser implements Tool {
 
   protected Configuration conf;
   protected Set agentNames;
+  protected int maxNumRedirects = 5;
 
   /** set of host names or IPs to be explicitly excluded from robots.txt 
checking */
   protected Set allowList = new HashSet<>();
@@ -149,6 +150,10 @@ public abstract class RobotRulesParser implements Tool {
 }
   }
 }
+LOG.info("Checking robots.txt for the following agent names: {}", 
agentNames);
+
+maxNumRedirects = conf.getInt("http.robots.redirect.max", 5);
+LOG.info("Following max. {} robots.txt redirects", maxNumRedirects);
 
 String[] confAllowList = conf.getStrings("http.robot.rules.allowlist");
 if (confAllowList == null) {
@@ -294,8 +299,11 @@ public abstract class RobotRulesParser implements Tool {
   "",
   "\tlocal file or URL parsed as robots.txt file",
   "\tIf  starts with a protocol specification",
-  "\t(`http', `https', `ftp' or `file'), robots.txt it is fetched",
-  "\tusing the specified protocol. Otherwise, a local file is 
assumed.",
+  "\t(`http', `https', `ftp' or `file'), the URL is parsed, URL path",
+  "\tand query are removed and the path \"/robots.txt\" is appended.",
+  "\tThe resulting URL (the canonical robots.txt location) is then",
+  "\tfetched using the specified protocol.",
+  "\tIf the URL does not include a protocol, a local file is assumed.",
   "",
   "\tlocal file with URLs (one per line), for every URL",
   "\tthe path part (including the query) is checked whether",
@@ -323,6 +331,16 @@ public abstract class RobotRulesParser implements Tool {
   return -1;
 }
 
+if (args.length > 2) {
+  // set agent name from command-line in configuration
+  // Note: when fetching via protocol this must be done
+  // before the protocol is configured
+  String agents = args[2];
+  conf.set("http.robots.agents", agents);
+  conf.set("http.agent.name", agents.split(",")[0]);
+  setConf(conf);
+}
+
 Protocol protocol = null;
 URL robotsTxtUrl = null;
 if (args[0].matches("^(?:https?|ftp|file)://?.*")) {
@@ -334,6 +352,7 @@ public abstract class RobotRulesParser implements Tool {
   ProtocolFactory factory = new ProtocolFactory(conf);
   try {
 protocol = factory.getProtocol(robotsTxtUrl);
+LOG.debug("Using protocol {} to fetch robots.txt", 
protocol.getClass());
   } catch (ProtocolNotFound e) {
 LOG.error("No protocol found for {}: {}", args[0],
 StringUtils.stringifyException(e));
@@ -357

[nutch] branch master updated: NUTCH-3009 Upgrade to Hadoop 3.3.6

2023-10-21 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new bb68385f9 NUTCH-3009 Upgrade to Hadoop 3.3.6
bb68385f9 is described below

commit bb68385f9601b37c61ef5a2baac58740c975bddb
Author: Sebastian Nagel 
AuthorDate: Thu Sep 28 14:53:02 2023 +0200

NUTCH-3009 Upgrade to Hadoop 3.3.6
---
 default.properties | 2 +-
 ivy/ivy.xml| 8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/default.properties b/default.properties
index 17e0bffbb..06f2ed009 100644
--- a/default.properties
+++ b/default.properties
@@ -44,7 +44,7 @@ test.junit.output.format = plain
 javadoc.proxy.host=-J-DproxyHost=
 javadoc.proxy.port=-J-DproxyPort=
 javadoc.link.java=https://docs.oracle.com/en/java/javase/11/docs/api/
-javadoc.link.hadoop=https://hadoop.apache.org/docs/r3.3.4/api/
+javadoc.link.hadoop=https://hadoop.apache.org/docs/r3.3.6/api/
 javadoc.packages=org.apache.nutch.*
 
 dist.dir=./dist
diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index 6f3926244..e5ae3882f 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -53,19 +53,19 @@

 

-   
+   



-   
+   



-   
+   



-   
+   






[nutch] branch master updated: NUTCH-3002 Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive - implement class CaseInsensitiveMetadata providing case-insensitive me

2023-10-21 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new e96cfc56e NUTCH-3002 Protocol-okhttp HttpResponse: HTTP header 
metadata lookup should be case-insensitive - implement class 
CaseInsensitiveMetadata providing case-insensitive   metadata look-ups (but no 
spell-checking) - use CaseInsensitiveMetadata to hold HTTP header metadata in   
in the class OkHttpResponse of protocol-okhttp - add unit tests to prove the 
fix (and also case-insensitive look-ups   and spell-checking in protocol-http)
e96cfc56e is described below

commit e96cfc56ee04c8e7e07e11d4eef521b4674a9ec6
Author: Sebastian Nagel 
AuthorDate: Tue Sep 19 08:10:14 2023 +0200

NUTCH-3002 Protocol-okhttp HttpResponse: HTTP header metadata lookup should 
be case-insensitive
- implement class CaseInsensitiveMetadata providing case-insensitive
  metadata look-ups (but no spell-checking)
- use CaseInsensitiveMetadata to hold HTTP header metadata in
  in the class OkHttpResponse of protocol-okhttp
- add unit tests to prove the fix (and also case-insensitive look-ups
  and spell-checking in protocol-http)
---
 .../nutch/metadata/CaseInsensitiveMetadata.java|  33 +
 src/java/org/apache/nutch/metadata/Metadata.java   |   4 +-
 .../nutch/metadata/SpellCheckedMetadata.java   |   8 +-
 .../org/apache/nutch/net/protocols/Response.java   |   2 +-
 .../apache/nutch/protocol/http/TestResponse.java   | 152 
 .../nutch/protocol/okhttp/OkHttpResponse.java  |   3 +-
 .../apache/nutch/protocol/okhttp/TestResponse.java | 154 +
 7 files changed, 348 insertions(+), 8 deletions(-)

diff --git a/src/java/org/apache/nutch/metadata/CaseInsensitiveMetadata.java 
b/src/java/org/apache/nutch/metadata/CaseInsensitiveMetadata.java
new file mode 100644
index 0..92e848ca2
--- /dev/null
+++ b/src/java/org/apache/nutch/metadata/CaseInsensitiveMetadata.java
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.metadata;
+
+import java.util.TreeMap;
+
+/**
+ * A decorator to Metadata that adds for case-insensitive lookup of keys.
+ */
+public class CaseInsensitiveMetadata extends Metadata {
+
+  /**
+   * Constructs a new, empty metadata.
+   */
+  public CaseInsensitiveMetadata() {
+metadata = new TreeMap<>(String.CASE_INSENSITIVE_ORDER);
+  }
+
+}
diff --git a/src/java/org/apache/nutch/metadata/Metadata.java 
b/src/java/org/apache/nutch/metadata/Metadata.java
index 5c37911fb..7fa0bb12c 100644
--- a/src/java/org/apache/nutch/metadata/Metadata.java
+++ b/src/java/org/apache/nutch/metadata/Metadata.java
@@ -36,7 +36,7 @@ public class Metadata implements Writable, CreativeCommons, 
DublinCore,
   /**
* A map of all metadata attributes.
*/
-  private Map metadata = null;
+  protected Map metadata = null;
 
   /**
* Constructs a new, empty metadata.
@@ -66,7 +66,7 @@ public class Metadata implements Writable, CreativeCommons, 
DublinCore,
   }
 
   /**
-   * Get the value associated to a metadata name. If many values are 
assiociated
+   * Get the value associated to a metadata name. If many values are associated
* to the specified name, then the first one is returned.
* 
* @param name
diff --git a/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java 
b/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java
index fdbf1b62c..be161440e 100644
--- a/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java
+++ b/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java
@@ -25,7 +25,7 @@ import org.apache.commons.lang.StringUtils;
 
 /**
  * A decorator to Metadata that adds spellchecking capabilities to property
- * names. Currently used spelling vocabulary contains just the httpheaders from
+ * names. Currently used spelling vocabulary contains just the HTTP headers 
from
  * {@link HttpHeaders} class.
  * 
  */
@@ -94,7 +94,7 @@ public class SpellCheckedMetadata extends Metadata {
   /**
* Get the normalized name of metadata attribu

[nutch] branch master updated (a1ab4333e -> a74b57b90)

2023-10-03 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from a1ab4333e NUTCH-2897 Do not supress deprecated API warnings - 
deprecate constructor of NutchJob - remove deprocated call to Object.finalize() 
from Plugin.finalize()
 add a74b57b90 NUTCH-2853 bin/nutch: remove deprecated commands solrindex, 
solrdedup, solrclean

No new revisions were added by this update.

Summary of changes:
 src/bin/nutch | 16 
 1 file changed, 4 insertions(+), 12 deletions(-)



[nutch] branch master updated: NUTCH-2897 Do not supress deprecated API warnings - deprecate constructor of NutchJob - remove deprocated call to Object.finalize() from Plugin.finalize()

2023-10-03 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new a1ab4333e NUTCH-2897 Do not supress deprecated API warnings - 
deprecate constructor of NutchJob - remove deprocated call to Object.finalize() 
from Plugin.finalize()
a1ab4333e is described below

commit a1ab4333e0a1a28ac2e0f9c75871f7feeb5f2f81
Author: Sebastian Nagel 
AuthorDate: Sat Sep 30 11:12:07 2023 +0200

NUTCH-2897 Do not supress deprecated API warnings
- deprecate constructor of NutchJob
- remove deprocated call to Object.finalize() from Plugin.finalize()
---
 src/java/org/apache/nutch/plugin/Plugin.java |  2 --
 src/java/org/apache/nutch/util/NutchJob.java | 13 -
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/src/java/org/apache/nutch/plugin/Plugin.java 
b/src/java/org/apache/nutch/plugin/Plugin.java
index b2e717d20..3a0fb2e91 100644
--- a/src/java/org/apache/nutch/plugin/Plugin.java
+++ b/src/java/org/apache/nutch/plugin/Plugin.java
@@ -90,9 +90,7 @@ public class Plugin {
   }
 
   @Override
-  @SuppressWarnings("deprecation")
   protected void finalize() throws Throwable {
-super.finalize();
 shutDown();
   }
 }
diff --git a/src/java/org/apache/nutch/util/NutchJob.java 
b/src/java/org/apache/nutch/util/NutchJob.java
index 478b24f89..068c64fef 100644
--- a/src/java/org/apache/nutch/util/NutchJob.java
+++ b/src/java/org/apache/nutch/util/NutchJob.java
@@ -35,7 +35,18 @@ public class NutchJob extends Job {
 
   private static final String JOB_FAILURE_LOG_FORMAT = "%s job did not 
succeed, job id: %s, job status: %s, reason: %s";
 
-  @SuppressWarnings("deprecation")
+  /**
+   * @deprecated, use instead {@link #getInstance(Configuration)} or
+   * {@link Job#getInstance(Configuration, String)}.
+   * 
+   * @param conf
+   *  configuration for the job
+   * @param jobName
+   *  name of the job
+   * @throws IOException
+   *   see {@link Job#Job(Configuration, String)}
+   */
+  @Deprecated
   public NutchJob(Configuration conf, String jobName) throws IOException {
 super(conf, jobName);
 if (conf != null) {



[nutch] branch master updated: NUTCH-3010 Injector: count unique number of injected URLs - add counter urls_injected_unique - improve log messages reporting the counts of injected/merged URLs

2023-10-02 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 810b1d6ad NUTCH-3010 Injector: count unique number of injected URLs - 
add counter urls_injected_unique - improve log messages reporting the counts of 
injected/merged URLs
810b1d6ad is described below

commit 810b1d6ad50fa9021469b4ca5e1db9050a3263c5
Author: Sebastian Nagel 
AuthorDate: Sat Sep 30 08:09:18 2023 +0200

NUTCH-3010 Injector: count unique number of injected URLs
- add counter urls_injected_unique
- improve log messages reporting the counts of injected/merged URLs
---
 src/java/org/apache/nutch/crawl/Injector.java | 31 ---
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/src/java/org/apache/nutch/crawl/Injector.java 
b/src/java/org/apache/nutch/crawl/Injector.java
index b93e8ca76..9fca719f6 100644
--- a/src/java/org/apache/nutch/crawl/Injector.java
+++ b/src/java/org/apache/nutch/crawl/Injector.java
@@ -341,8 +341,11 @@ public class Injector extends NutchTool implements Tool {
   ? injected.getFetchInterval() : old.getFetchInterval());
 }
   }
-  if (injectedSet && oldSet) {
-context.getCounter("injector", "urls_merged").increment(1);
+  if (injectedSet) {
+context.getCounter("injector", "urls_injected_unique").increment(1);
+if (oldSet) {
+  context.getCounter("injector", "urls_merged").increment(1);
+}
   }
   context.write(key, result);
 }
@@ -448,22 +451,24 @@ public class Injector extends NutchTool implements Tool {
   if (LOG.isInfoEnabled()) {
 long urlsInjected = job.getCounters()
 .findCounter("injector", "urls_injected").getValue();
+long urlsInjectedUniq = job.getCounters()
+.findCounter("injector", "urls_injected_unique").getValue();
 long urlsFiltered = job.getCounters()
 .findCounter("injector", "urls_filtered").getValue();
 long urlsMerged = job.getCounters()
 .findCounter("injector", "urls_merged").getValue();
-long urlsPurged404= job.getCounters()
+long urlsPurged404 = job.getCounters()
 .findCounter("injector", "urls_purged_404").getValue();
-long urlsPurgedFilter= job.getCounters()
+long urlsPurgedFilter = job.getCounters()
 .findCounter("injector", "urls_purged_filter").getValue();
-LOG.info("Injector: Total urls rejected by filters: " + urlsFiltered);
+LOG.info("Injector: Total urls rejected by filters: {}", urlsFiltered);
 LOG.info(
-"Injector: Total urls injected after normalization and filtering: "
-+ urlsInjected);
-LOG.info("Injector: Total urls injected but already in CrawlDb: "
-+ urlsMerged);
-LOG.info("Injector: Total new urls injected: "
-+ (urlsInjected - urlsMerged));
+"Injector: Total urls injected after normalization and filtering: 
{} (unique URLs: {})",
+urlsInjected, urlsInjectedUniq);
+LOG.info("Injector: Total urls injected but already in CrawlDb: {}",
+urlsMerged);
+LOG.info("Injector: Total new urls injected: {}",
+(urlsInjectedUniq - urlsMerged));
 if (filterNormalizeAll) {
   LOG.info("Injector: Total urls removed from CrawlDb by filters: {}",
   urlsPurgedFilter);
@@ -475,8 +480,8 @@ public class Injector extends NutchTool implements Tool {
 }
 
 long end = System.currentTimeMillis();
-LOG.info("Injector: finished at " + sdf.format(end) + ", elapsed: "
-+ TimingUtil.elapsedTime(start, end));
+LOG.info("Injector: finished at {}, elapsed: {}", sdf.format(end),
+TimingUtil.elapsedTime(start, end));
   }
 } catch (IOException | InterruptedException | ClassNotFoundException | 
NullPointerException e) {
   LOG.error("Injector job failed: {}", e.getMessage());



[nutch] branch master updated (417b87732 -> a72a53a32)

2023-09-30 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from 417b87732 NUTCH-2852 SpotBugs: Method invokes System.exit(...) - 
remove all calls of System.exit(...) in methods   except main(args) of various 
"checker" tools
 add a72a53a32 NUTCH-3007 Fix impossible casts - remove code blocks (else 
clauses) unneeded and containing   impossible casts

No new revisions were added by this update.

Summary of changes:
 src/java/org/apache/nutch/fetcher/Fetcher.java| 13 ++---
 src/java/org/apache/nutch/parse/ParseSegment.java | 13 ++---
 2 files changed, 4 insertions(+), 22 deletions(-)



[nutch] branch master updated: NUTCH-2852 SpotBugs: Method invokes System.exit(...) - remove all calls of System.exit(...) in methods except main(args) of various "checker" tools

2023-09-30 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 417b87732 NUTCH-2852 SpotBugs: Method invokes System.exit(...) - 
remove all calls of System.exit(...) in methods   except main(args) of various 
"checker" tools
417b87732 is described below

commit 417b8773231136eb48957f743c2bc3c21f624d4e
Author: Sebastian Nagel 
AuthorDate: Thu Sep 28 12:05:50 2023 +0200

NUTCH-2852 SpotBugs: Method invokes System.exit(...)
- remove all calls of System.exit(...) in methods
  except main(args) of various "checker" tools
---
 src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java | 4 ++--
 src/java/org/apache/nutch/net/URLFilterChecker.java   | 4 ++--
 src/java/org/apache/nutch/net/URLNormalizerChecker.java   | 4 ++--
 src/java/org/apache/nutch/parse/ParserChecker.java| 4 ++--
 src/java/org/apache/nutch/util/AbstractChecker.java   | 9 -
 5 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java 
b/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
index 3aa7a05cb..1931c360d 100644
--- a/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
+++ b/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
@@ -93,7 +93,7 @@ public class IndexingFiltersChecker extends AbstractChecker {
 // Print help when no args given
 if (args.length < 1) {
   System.err.println(usage);
-  System.exit(-1);
+  return -1;
 }
 
 // read property "doIndex" for back-ward compatibility
@@ -126,7 +126,7 @@ public class IndexingFiltersChecker extends AbstractChecker 
{
   } else if (i != args.length - 1) {
 System.err.println("ERR: Not a recognized argument: " + args[i]);
 System.err.println(usage);
-System.exit(-1);
+return -1;
   } else {
 url = args[i];
   }
diff --git a/src/java/org/apache/nutch/net/URLFilterChecker.java 
b/src/java/org/apache/nutch/net/URLFilterChecker.java
index 7916cc579..821f2e926 100644
--- a/src/java/org/apache/nutch/net/URLFilterChecker.java
+++ b/src/java/org/apache/nutch/net/URLFilterChecker.java
@@ -41,7 +41,7 @@ public class URLFilterChecker extends AbstractChecker {
 // Print help when no args given
 if (args.length < 1) {
   System.err.println(usage);
-  System.exit(-1);
+  return -1;
 }
 
 int numConsumed;
@@ -53,7 +53,7 @@ public class URLFilterChecker extends AbstractChecker {
   } else {
 System.err.println("ERROR: Not a recognized argument: " + args[i]);
 System.err.println(usage);
-System.exit(-1);
+return -1;
   }
 }
 
diff --git a/src/java/org/apache/nutch/net/URLNormalizerChecker.java 
b/src/java/org/apache/nutch/net/URLNormalizerChecker.java
index 586c7b246..46fdd38cf 100644
--- a/src/java/org/apache/nutch/net/URLNormalizerChecker.java
+++ b/src/java/org/apache/nutch/net/URLNormalizerChecker.java
@@ -44,7 +44,7 @@ public class URLNormalizerChecker extends AbstractChecker {
 // Print help when no args given
 if (args.length < 1) {
   System.err.println(usage);
-  System.exit(-1);
+  return -1;
 }
 
 int numConsumed;
@@ -58,7 +58,7 @@ public class URLNormalizerChecker extends AbstractChecker {
   } else {
 System.err.println("ERROR: Not a recognized argument: " + args[i]);
 System.err.println(usage);
-System.exit(-1);
+return -1;
   }
 }
 
diff --git a/src/java/org/apache/nutch/parse/ParserChecker.java 
b/src/java/org/apache/nutch/parse/ParserChecker.java
index 1533ab57c..10eec4b24 100644
--- a/src/java/org/apache/nutch/parse/ParserChecker.java
+++ b/src/java/org/apache/nutch/parse/ParserChecker.java
@@ -104,7 +104,7 @@ public class ParserChecker extends AbstractChecker {
 // Print help when no args given
 if (args.length < 1) {
   System.err.println(usage);
-  System.exit(-1);
+  return -1;
 }
 
 // initialize plugins early to register URL stream handlers to support
@@ -138,7 +138,7 @@ public class ParserChecker extends AbstractChecker {
   } else if (i != args.length - 1) {
 System.err.println("ERR: Not a recognized argument: " + args[i]);
 System.err.println(usage);
-System.exit(-1);
+return -1;
   } else {
 url = args[i];
   }
diff --git a/src/java/org/apache/nutch/util/AbstractChecker.java 
b/src/java/org/apache/nutch/util/AbstractChecker.java
index 3116ede14..137481225 100644
--- a/src/java/org/apache/nutch/util/AbstractChecker.java
+++ b/src/java/org/apache/nutch/util/AbstractChecker.java
@@ -72,8 +72,7 @@ public abstract class AbstractChecker extends Configured 
imp

[nutch] branch master updated: NUTCH-2997 Add Override annotations

2023-08-22 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 0fae6b59f NUTCH-2997 Add Override annotations
0fae6b59f is described below

commit 0fae6b59fd85f2ec894a28089c1d086b2604660a
Author: Sebastian Nagel 
AuthorDate: Mon Aug 14 16:08:58 2023 +0200

NUTCH-2997 Add Override annotations
---
 src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java   | 8 
 src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java   | 1 +
 src/java/org/apache/nutch/crawl/CrawlDatum.java  | 8 
 src/java/org/apache/nutch/crawl/CrawlDbReducer.java  | 1 +
 src/java/org/apache/nutch/crawl/Generator.java   | 5 +
 src/java/org/apache/nutch/crawl/Inlink.java  | 5 +
 src/java/org/apache/nutch/crawl/Inlinks.java | 3 +++
 src/java/org/apache/nutch/crawl/LinkDbReader.java| 1 +
 src/java/org/apache/nutch/crawl/MD5Signature.java| 1 +
 src/java/org/apache/nutch/crawl/MimeAdaptiveFetchSchedule.java   | 1 +
 src/java/org/apache/nutch/crawl/Signature.java   | 2 ++
 src/java/org/apache/nutch/crawl/SignatureComparator.java | 1 +
 src/java/org/apache/nutch/crawl/TextMD5Signature.java| 1 +
 src/java/org/apache/nutch/crawl/TextProfileSignature.java| 3 +++
 src/java/org/apache/nutch/crawl/URLPartitioner.java  | 1 +
 src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java   | 2 ++
 src/java/org/apache/nutch/fetcher/FetcherThread.java | 1 +
 src/java/org/apache/nutch/fetcher/QueueFeeder.java   | 1 +
 src/java/org/apache/nutch/hostdb/ResolverThread.java | 1 +
 src/java/org/apache/nutch/indexer/IndexerOutputFormat.java   | 2 ++
 src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java| 1 +
 src/java/org/apache/nutch/indexer/NutchDocument.java | 4 
 src/java/org/apache/nutch/indexer/NutchIndexAction.java  | 2 ++
 src/java/org/apache/nutch/metadata/MetaWrapper.java  | 2 ++
 src/java/org/apache/nutch/metadata/Metadata.java | 3 +++
 src/java/org/apache/nutch/net/URLFilterChecker.java  | 1 +
 src/java/org/apache/nutch/net/URLNormalizerChecker.java  | 1 +
 src/java/org/apache/nutch/parse/HTMLMetaTags.java| 1 +
 src/java/org/apache/nutch/parse/Outlink.java | 4 
 src/java/org/apache/nutch/parse/ParseData.java   | 4 
 src/java/org/apache/nutch/parse/ParseImpl.java   | 5 +
 src/java/org/apache/nutch/parse/ParseOutputFormat.java   | 3 +++
 src/java/org/apache/nutch/parse/ParseResult.java | 1 +
 src/java/org/apache/nutch/parse/ParseStatus.java | 7 +++
 src/java/org/apache/nutch/parse/ParseText.java   | 2 ++
 src/java/org/apache/nutch/parse/ParserChecker.java   | 1 +
 src/java/org/apache/nutch/plugin/Extension.java  | 1 +
 src/java/org/apache/nutch/plugin/Plugin.java | 1 +
 src/java/org/apache/nutch/plugin/PluginClassLoader.java  | 3 +++
 src/java/org/apache/nutch/plugin/PluginRepository.java   | 2 ++
 src/java/org/apache/nutch/protocol/Content.java  | 4 
 src/java/org/apache/nutch/protocol/ProtocolStatus.java   | 4 
 src/java/org/apache/nutch/scoring/ScoringFilters.java| 9 +
 src/java/org/apache/nutch/scoring/webgraph/LinkDatum.java| 3 +++
 src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java   | 4 
 src/java/org/apache/nutch/scoring/webgraph/Node.java | 3 +++
 src/java/org/apache/nutch/segment/ContentAsTextInputFormat.java  | 6 ++
 src/java/org/apache/nutch/segment/SegmentMerger.java | 1 +
 src/java/org/apache/nutch/segment/SegmentPart.java   | 1 +
 src/java/org/apache/nutch/segment/SegmentReader.java | 6 ++
 src/java/org/apache/nutch/service/impl/ConfManagerImpl.java  | 6 ++
 src/java/org/apache/nutch/service/impl/SeedManagerImpl.java  | 4 
 src/java/org/apache/nutch/service/resources/AdminResource.java   | 1 +
 src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java   | 4 
 src/java/org/apache/nutch/tools/CommonCrawlFormat.java   | 1 +
 src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java | 9 ++---
 src/java/org/apache/nutch/tools/CommonCrawlFormatWARC.java   | 1 +
 src/java/org/apache/nutch/tools/DmozParser.java  | 2 ++
 src/java/org/apache/nutch/tools/ResolveUrls.java | 1 +
 src/java/org/apache/nutch/tools/arc/ArcInputFormat.java  | 1 +
 src/java/org/apache/nutch/tools/arc/ArcRecordReader.java | 6 ++
 src

[nutch] branch master updated: NUTCH-2996 Use new SimpleRobotRulesParser API entry point crawler-commons 1.4

2023-08-22 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 070c115cf NUTCH-2996 Use new SimpleRobotRulesParser API entry point 
crawler-commons 1.4
070c115cf is described below

commit 070c115cfadbc937a8ad0add6447461983e92028
Author: Sebastian Nagel 
AuthorDate: Tue Aug 22 11:39:22 2023 +0200

NUTCH-2996 Use new SimpleRobotRulesParser API entry point crawler-commons 
1.4

- split and lowercase agent names (if multiple) at configuration time
  and pass as collection to SimpleRobotRulesParser
- update RobotRulesParser command-line help
- update unit tests to use new API
- update description of Nutch properties to reflect the changes due to
  the usage of the new API entry point and the upgrade to crawler-commons 
1.4
---
 conf/nutch-default.xml | 34 +
 .../apache/nutch/protocol/RobotRulesParser.java| 71 +-
 .../protocol/http/api/TestRobotRulesParser.java| 87 --
 3 files changed, 135 insertions(+), 57 deletions(-)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 379b5ef5d..e98bd5570 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -72,9 +72,18 @@
 
   http.agent.name
   
-  HTTP 'User-Agent' request header. MUST NOT be empty -
+  'User-Agent' name: a single word uniquely identifying your 
crawler.
+
+  The value is used to select the group of robots.txt rules addressing your
+  crawler. It is also sent as part of the HTTP 'User-Agent' request header.
+
+  This property MUST NOT be empty -
   please set this to a single word uniquely related to your organization.
 
+  Following RFC 9309 the 'User-Agent' name (aka. 'product token')
+  MUST contain only uppercase and lowercase letters ('a-z' and
+  'A-Z'), underscores ('_'), and hyphens ('-').
+
   NOTE: You should also check other related properties:
 
 http.robots.agents
@@ -84,7 +93,6 @@
 http.agent.version
 
   and set their values appropriately.
-
   
 
 
@@ -95,13 +103,13 @@
   parser would look for in robots.txt. Multiple agents can be provided using
   comma as a delimiter. eg. mybot,foo-spider,bar-crawler
 
-  The ordering of agents does NOT matter and the robots parser would make
-  decision based on the agent which matches first to the robots rules.
-  Also, there is NO need to add a wildcard (ie. "*") to this string as the
-  robots parser would smartly take care of a no-match situation.
+  The ordering of agents does NOT matter and the robots.txt parser combines
+  all rules to any of the agent names.  Also, there is NO need to add
+  a wildcard (ie. "*") to this string as the robots parser would smartly
+  take care of a no-match situation.
 
   If no value is specified, by default HTTP agent (ie. 'http.agent.name')
-  would be used for user agent matching by the robots parser.
+  is used for user-agent matching by the robots parser.
   
 
 
@@ -166,9 +174,9 @@
 
   http.agent.url
   
-  A URL to advertise in the User-Agent header.  This will
+  A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
-   should be a URL of a page explaining the purpose and behavior of this
+   should be a URL to a page that explains the purpose and behavior of this
crawler.
   
 
@@ -176,9 +184,9 @@
 
   http.agent.email
   
-  An email address to advertise in the HTTP 'From' request
-   header and User-Agent header. A good practice is to mangle this
-   address (e.g. 'info at example dot com') to avoid spamming.
+  An email address to advertise in the HTTP 'User-Agent' (and
+   'From') request headers. A good practice is to mangle this address
+   (e.g. 'info at example dot com') to avoid spamming.
   
 
 
@@ -202,7 +210,7 @@
   http.agent.rotate.file
   agents.txt
   
-File containing alternative user agent names to be used instead of
+File containing alternative user-agent names to be used instead of
 http.agent.name on a rotating basis if http.agent.rotate is true.
 Each line of the file should contain exactly one agent
 specification including name, version, description, URL, etc.
diff --git a/src/java/org/apache/nutch/protocol/RobotRulesParser.java 
b/src/java/org/apache/nutch/protocol/RobotRulesParser.java
index 1493bc292..562c2c694 100644
--- a/src/java/org/apache/nutch/protocol/RobotRulesParser.java
+++ b/src/java/org/apache/nutch/protocol/RobotRulesParser.java
@@ -24,12 +24,13 @@ import java.io.LineNumberReader;
 import java.lang.invoke.MethodHandles;
 import java.net.MalformedURLException;
 import java.net.URL;
+import java.util.Collection;
 import java.util.HashSet;
 import java.util.Hashtable;
+import java.util.LinkedHashSet;
 import java.util.LinkedList;
 import java.util.List;
 import ja

[nutch] branch master updated: NUTCH-2995 Upgrade to crawler-commons 1.4

2023-08-22 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new a24ec5c5b  NUTCH-2995 Upgrade to crawler-commons 1.4
a24ec5c5b is described below

commit a24ec5c5b761476897c7fff0bfd3d5107995fedc
Author: Sebastian Nagel 
AuthorDate: Tue Aug 22 10:36:45 2023 +0200

 NUTCH-2995 Upgrade to crawler-commons 1.4

- upgrade to crawler-commons from 1.3 to 1.4
- update Javadoc and improve code formatting of robots.txt unit tests
- fix robots.txt unit tests to reflect changes in
  crawler-commons due to RFC 9309 compliance and merging of rule groups
  (see https://www.rfc-editor.org/rfc/rfc9309.html#section-2.2.1)
- mark unit tests for deprecated API endpoints as deprecated
---
 ivy/ivy.xml|   2 +-
 .../protocol/http/api/TestRobotRulesParser.java| 102 +++--
 2 files changed, 74 insertions(+), 30 deletions(-)

diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index 269f521c8..18a6df230 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -65,7 +65,7 @@
 

 
-   
+   
 


diff --git 
a/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
 
b/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
index 93bb51b22..265abf934 100644
--- 
a/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
+++ 
b/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
@@ -22,32 +22,37 @@ import org.junit.Test;
 import crawlercommons.robots.BaseRobotRules;
 
 /**
- * JUnit test case which tests 1. that robots filtering is performed correctly
- * as per the agent name 2. that crawl delay is extracted correctly from the
- * robots file
- * 
+ * JUnit test case which tests
+ * 
+ * that robots filtering is performed correctly as per the agent name
+ * that crawl delay is extracted correctly from the robots.txt file
+ * 
  */
 public class TestRobotRulesParser {
 
   private static final String CONTENT_TYPE = "text/plain";
-  private static final String SINGLE_AGENT = "Agent1";
-  private static final String MULTIPLE_AGENTS = "Agent2, Agent1";
+  private static final String SINGLE_AGENT1 = "Agent1";
+  private static final String SINGLE_AGENT2 = "Agent2";
+  private static final String MULTIPLE_AGENTS = "Agent2, Agent1"; // rules are 
merged for both agents
   private static final String UNKNOWN_AGENT = "AgentABC";
   private static final String CR = "\r";
 
-  private static final String ROBOTS_STRING = "User-Agent: Agent1 #foo" + CR
-  + "Disallow: /a" + CR + "Disallow: /b/a" + CR + "#Disallow: /c"
-  + CR
-  + "Crawl-delay: 10"
-  + CR // set crawl delay for Agent1 as 10 sec
-  + "" + CR + "" + CR + "User-Agent: Agent2" + CR + "Disallow: /a/bloh"
-  + CR + "Disallow: /c" + CR + "Disallow: /foo" + CR + "Crawl-delay: 20"
-  + CR + "" + CR + "User-Agent: *" + CR + "Disallow: /foo/bar/" + CR; // no
-  // 
crawl
-  // 
delay
-  // 
for
-  // 
other
-  // 
agents
+  private static final String ROBOTS_STRING = //
+  "User-Agent: Agent1 #foo" + CR //
+  + "Disallow: /a" + CR //
+  + "Disallow: /b/a" + CR //
+  + "#Disallow: /c" + CR //
+  + "Crawl-delay: 10" + CR // set crawl delay for Agent1 as 10 seconds
+  + "" + CR //
+  + "" + CR //
+  + "User-Agent: Agent2" + CR //
+  + "Disallow: /a/bloh" + CR //
+  + "Disallow: /c" + CR //
+  + "Disallow: /foo" + CR //
+  + "Crawl-delay: 20" + CR // Agent2: 20 seconds
+  + "" + CR //
+  + "User-Agent: *" + CR //
+  + "Disallow: /foo/bar/" + CR; // no crawl delay for other agents
 
   private static final String[] TEST_PATHS = new String[] {
   "http://example.com/a;, "http://example.com/a/bloh/foo.html;,
@@ -55,7 +60,8 @@ public class TestRobotRulesParser {
   "http://example.com/b/a/index.html;,
   "http://example.com/foo/bar/baz.html

[nutch] branch master updated: NUTCH-2993 ScoringDepth plugin to skip depth check based on URL Pattern - apply patch contributed by Markus Jelsma

2023-08-22 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new eae3c52a8 NUTCH-2993 ScoringDepth plugin to skip depth check based on 
URL Pattern - apply patch contributed by Markus Jelsma
eae3c52a8 is described below

commit eae3c52a8140344dff46c448664a2467d631cefc
Author: Sebastian Nagel 
AuthorDate: Thu Jul 20 13:44:26 2023 +0200

NUTCH-2993 ScoringDepth plugin to skip depth check based on URL Pattern
- apply patch contributed by Markus Jelsma
---
 conf/nutch-default.xml | 16 ++
 .../nutch/scoring/depth/DepthScoringFilter.java| 25 ++
 2 files changed, 41 insertions(+)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 273cfccc5..379b5ef5d 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -1918,6 +1918,22 @@ CAUTION: Set the parser.timeout to -1 or a bigger value 
than 30, when using this
   
 
 
+
+  scoring.depth.override.pattern
+  
+  URLs matching this pattern pass a different max depth value
+  to their outlinks configured in scoring.depth.max.override.
+  
+
+
+
+  scoring.depth.max.override
+  
+  This max depth value is passed to outlinks matching the pattern
+  configured in scoring.depth.override.pattern.
+  
+
+
 

[nutch-site] branch asf-staging updated: Add logo on URL path where requested README.md in source code repository

2023-08-04 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/asf-staging by this push:
 new e1f939c  Add logo on URL path where requested README.md in source code 
repository
e1f939c is described below

commit e1f939cb5820423eb00331d783f6934656d2e37c
Author: Sebastian Nagel 
AuthorDate: Fri Aug 4 20:07:37 2023 +0200

Add logo on URL path where requested README.md in source code repository
---
 content/assets/img/nutch_logo_tm.png | Bin 0 -> 9984 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)

diff --git a/content/assets/img/nutch_logo_tm.png 
b/content/assets/img/nutch_logo_tm.png
new file mode 100644
index 000..67b0eba
Binary files /dev/null and b/content/assets/img/nutch_logo_tm.png differ



[nutch-site] branch main updated: Add logo on URL path where requested README.md in source code repository

2023-08-04 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/main by this push:
 new c80dcca  Add logo on URL path where requested README.md in source code 
repository
c80dcca is described below

commit c80dccaaab9e5084d0229a9916b51d93e9590b3a
Author: Sebastian Nagel 
AuthorDate: Fri Aug 4 20:07:04 2023 +0200

Add logo on URL path where requested README.md in source code repository
---
 content/assets/img/nutch_logo_tm.png | Bin 0 -> 9984 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)

diff --git a/content/assets/img/nutch_logo_tm.png 
b/content/assets/img/nutch_logo_tm.png
new file mode 100644
index 000..67b0eba
Binary files /dev/null and b/content/assets/img/nutch_logo_tm.png differ



[nutch-site] branch asf-site updated: Add logo on URL path where requested README.md in source code repository

2023-08-04 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 3962502  Add logo on URL path where requested README.md in source code 
repository
3962502 is described below

commit 3962502176832a616931fa9bff41f3e119071928
Author: Sebastian Nagel 
AuthorDate: Fri Aug 4 20:05:34 2023 +0200

Add logo on URL path where requested README.md in source code repository
---
 content/assets/img/nutch_logo_tm.png | Bin 0 -> 9984 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)

diff --git a/content/assets/img/nutch_logo_tm.png 
b/content/assets/img/nutch_logo_tm.png
new file mode 100644
index 000..67b0eba
Binary files /dev/null and b/content/assets/img/nutch_logo_tm.png differ



[nutch-site] branch asf-site updated: Add link to ASF privacy policies

2023-07-20 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new c252fd7  Add link to ASF privacy policies
c252fd7 is described below

commit c252fd76668ec9d30c3a1b8ede341ed83e9fb203
Author: Sebastian Nagel 
AuthorDate: Thu Jul 20 10:56:16 2023 +0200

Add link to ASF privacy policies
---
 content/apache/index.html | 1 +
 content/index.xml | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/content/apache/index.html b/content/apache/index.html
index 01641dc..6e832f5 100644
--- a/content/apache/index.html
+++ b/content/apache/index.html
@@ -112,6 +112,7 @@
 The https://www.apache.org/security/;>Apache Security 
Team
 The https://www.apache.org/foundation/sponsorship.html;>Apache 
Software Foundation Sponsorship Program
 https://www.apache.org/foundation/thanks.html;>Sponsors and 
Thanks
+https://privacy.apache.org/policies/privacy-policy-public.html;>ASF 
Privacy Policies
 
 
 
diff --git a/content/index.xml b/content/index.xml
index d4296bc..dee6d72 100644
--- a/content/index.xml
+++ b/content/index.xml
@@ -55,7 +55,7 @@ As usual in the 1.X series, release artifacts are made 
available as both source
   Mon, 01 Jan 0001 00:00:00 +
   
   /apache/
-   Visit the Apache Software Foundation Homepage Information 
about the Apache Licenses The Apache Security Team The Apache Software 
Foundation Sponsorship Program Sponsors and Thanks 
+   Visit the Apache Software Foundation Homepage Information 
about the Apache Licenses The Apache Security Team The Apache Software 
Foundation Sponsorship Program Sponsors and Thanks ASF Privacy Policies 

 
 
 



[nutch-site] branch main updated: Add link to ASF privacy policies

2023-07-20 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/main by this push:
 new d0832c1  Add link to ASF privacy policies
d0832c1 is described below

commit d0832c177981842bc7c67e019bfa1a6eb07ff39d
Author: Sebastian Nagel 
AuthorDate: Thu Jul 20 10:55:31 2023 +0200

Add link to ASF privacy policies
---
 content/apache.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/content/apache.md b/content/apache.md
index b27f09c..717e854 100644
--- a/content/apache.md
+++ b/content/apache.md
@@ -12,5 +12,6 @@ bref = ""
 * The [Apache Security Team](https://www.apache.org/security/)
 * The [Apache Software Foundation Sponsorship 
Program](https://www.apache.org/foundation/sponsorship.html)
 * [Sponsors and Thanks](https://www.apache.org/foundation/thanks.html)
+* [ASF Privacy 
Policies](https://privacy.apache.org/policies/privacy-policy-public.html)
 * 
 



[nutch-site] branch asf-staging updated: Add link to ASF privacy policies

2023-07-20 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/asf-staging by this push:
 new 3ff0ddb  Add link to ASF privacy policies
3ff0ddb is described below

commit 3ff0ddb731690693fba8db14465173ea578d61d2
Author: Sebastian Nagel 
AuthorDate: Thu Jul 20 10:56:16 2023 +0200

Add link to ASF privacy policies
---
 content/apache/index.html | 1 +
 content/index.xml | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/content/apache/index.html b/content/apache/index.html
index 01641dc..6e832f5 100644
--- a/content/apache/index.html
+++ b/content/apache/index.html
@@ -112,6 +112,7 @@
 The https://www.apache.org/security/;>Apache Security 
Team
 The https://www.apache.org/foundation/sponsorship.html;>Apache 
Software Foundation Sponsorship Program
 https://www.apache.org/foundation/thanks.html;>Sponsors and 
Thanks
+https://privacy.apache.org/policies/privacy-policy-public.html;>ASF 
Privacy Policies
 
 
 
diff --git a/content/index.xml b/content/index.xml
index d4296bc..dee6d72 100644
--- a/content/index.xml
+++ b/content/index.xml
@@ -55,7 +55,7 @@ As usual in the 1.X series, release artifacts are made 
available as both source
   Mon, 01 Jan 0001 00:00:00 +
   
   /apache/
-   Visit the Apache Software Foundation Homepage Information 
about the Apache Licenses The Apache Security Team The Apache Software 
Foundation Sponsorship Program Sponsors and Thanks 
+   Visit the Apache Software Foundation Homepage Information 
about the Apache Licenses The Apache Security Team The Apache Software 
Foundation Sponsorship Program Sponsors and Thanks ASF Privacy Policies 

 
 
 



[nutch-site] 01/03: - add link / banner of Apache conferences or events - rename and move link to ASF

2023-07-20 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/nutch-site.git

commit 7cd1d1cce957346615a0cb1efbfd875932764d70
Author: Sebastian Nagel 
AuthorDate: Thu Jul 20 10:32:50 2023 +0200

- add link / banner of Apache conferences or events
- rename and move link to ASF
---
 config.toml  | 2 +-
 content/apache.md| 4 +++-
 themes/kube/layouts/_default/baseof.html | 1 +
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/config.toml b/config.toml
index a78ef2d..290476d 100644
--- a/config.toml
+++ b/config.toml
@@ -39,7 +39,7 @@ unsafe = true # allow raw HTML in markdown content
 weight = -100
 url = "/news/"
 [[menu.main]]
-name = "Apache"
+name = "The Apache Software Foundation"
 weight = -100
 url = "/apache/"
 
diff --git a/content/apache.md b/content/apache.md
index e4aef9c..b27f09c 100644
--- a/content/apache.md
+++ b/content/apache.md
@@ -11,4 +11,6 @@ bref = ""
 * Information about the [Apache Licenses](https://www.apache.org/licenses/)
 * The [Apache Security Team](https://www.apache.org/security/)
 * The [Apache Software Foundation Sponsorship 
Program](https://www.apache.org/foundation/sponsorship.html)
-* [Sponsors and Thanks](https://www.apache.org/foundation/thanks.html)
\ No newline at end of file
+* [Sponsors and Thanks](https://www.apache.org/foundation/thanks.html)
+* 
+
diff --git a/themes/kube/layouts/_default/baseof.html 
b/themes/kube/layouts/_default/baseof.html
index 3f7ec06..fec7378 100644
--- a/themes/kube/layouts/_default/baseof.html
+++ b/themes/kube/layouts/_default/baseof.html
@@ -46,6 +46,7 @@
   
 
   
+  https://www.apachecon.com/event-images/snippet.js&quot</a>;>
 
 
 



[nutch-site] 03/03: Add new committer / PMC

2023-07-20 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/nutch-site.git

commit db7208f4333d1208516db09b3ac4309d9402881c
Author: Sebastian Nagel 
AuthorDate: Thu Jul 20 10:36:26 2023 +0200

Add new committer / PMC
---
 content/community/people-credits.md | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/content/community/people-credits.md 
b/content/community/people-credits.md
index 9d66c7f..b17d5ec 100644
--- a/content/community/people-credits.md
+++ b/content/community/people-credits.md
@@ -169,6 +169,13 @@ bref = ""
Committer, PMC Member
Microsoft
   
+  
+   tallison
+   https://www.linkedin.com/in/tim-allison-5a6722/;>Tim 
Allison
+   tallison[at]apache[dot]org
+   Committer, PMC Member
+   NASA JPL
+  
  
 
 



[nutch-site] 02/03: Update copyright year 2022 -> 2023

2023-07-20 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/nutch-site.git

commit 44463bd9e75c654d775d9337989e46e75359ed1a
Author: Sebastian Nagel 
AuthorDate: Thu Jul 20 10:35:46 2023 +0200

Update copyright year 2022 -> 2023
---
 themes/kube/layouts/partials/footer.html | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/themes/kube/layouts/partials/footer.html 
b/themes/kube/layouts/partials/footer.html
index 2081d5f..59fe554 100644
--- a/themes/kube/layouts/partials/footer.html
+++ b/themes/kube/layouts/partials/footer.html
@@ -1,3 +1,3 @@
   
- 2004-2022 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target="_blank" rel="noopener 
noreferrer">kube Theme for Hugo. Apache Nutch, Nutch, Apache, the Apache 
feather logo, and the Apache Nutch project logo are trademarks of The Apache 
Software Foundation.
+ 2004-2023 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target="_blank" rel="noopener 
noreferrer">kube Theme for Hugo. Apache Nutch, Nutch, Apache, the Apache 
feather logo, and the Apache Nutch project logo are trademarks of The Apache 
Software Foundation.
   
\ No newline at end of file



[nutch-site] branch asf-site updated: - add new committer / PMC - update copyright year 2022 -> 2023 - add link / banner of Apache conferences or events - rename and move link to ASF

2023-07-20 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 773089d  - add new committer / PMC - update copyright year 2022 -> 
2023 - add link / banner of Apache conferences or events - rename and move link 
to ASF
773089d is described below

commit 773089d37c2f5a8a112275a71a8698f941562391
Author: Sebastian Nagel 
AuthorDate: Thu Jul 20 10:43:11 2023 +0200

- add new committer / PMC
- update copyright year 2022 -> 2023
- add link / banner of Apache conferences or events
- rename and move link to ASF
---
 content/apache/index.html | 11 ++-
 content/categories/index.html | 10 +-
 content/categories/news/index.html| 10 +-
 content/categories/news/page/1/index.html | 11 ++-
 content/categories/page/1/index.html  | 11 ++-
 content/categories/releases/index.html| 10 +-
 content/categories/releases/page/1/index.html | 11 ++-
 content/community/board-reporting/index.html  | 10 +-
 content/community/bot/index.html  | 12 ++--
 content/community/contributing/index.html | 10 +-
 content/community/index.html  | 10 +-
 content/community/index.xml   |  8 
 content/community/mailing-lists/index.html| 10 +-
 content/community/merchandise/index.html  | 10 +-
 content/community/people-credits/index.html   | 17 -
 content/development/index.html| 10 +-
 content/development/issue-tracker/index.html  | 10 +-
 content/development/nightly-builds/index.html | 10 +-
 content/development/source-code-management/index.html | 10 +-
 content/documentation/about/index.html| 10 +-
 content/documentation/faqs/index.html | 10 +-
 content/documentation/index.html  | 10 +-
 content/documentation/javadoc/index.html  | 10 +-
 content/documentation/tutorials/index.html| 10 +-
 content/documentation/wiki/index.html | 10 +-
 content/download/index.html   | 10 +-
 content/index.html| 10 +-
 content/index.xml | 12 ++--
 content/news/index.html   | 10 +-
 content/news/legacy-nutch-news/index.html | 10 +-
 content/news/nutch-1.18-release/index.html| 10 +-
 content/news/nutch-1.19-release/index.html| 10 +-
 content/news/page/1/index.html| 11 ++-
 content/tags/1.18/index.html  | 10 +-
 content/tags/1.18/page/1/index.html   | 11 ++-
 content/tags/1.19/index.html  | 10 +-
 content/tags/1.19/page/1/index.html   | 11 ++-
 content/tags/index.html   | 10 +-
 content/tags/legacy/index.html| 10 +-
 content/tags/legacy/page/1/index.html | 11 ++-
 content/tags/news/index.html  | 10 +-
 content/tags/news/page/1/index.html   | 11 ++-
 content/tags/page/1/index.html| 11 ++-
 content/tags/page/2/index.html| 10 +-
 content/tags/release/index.html   | 10 +-
 content/tags/release/page/1/index.html| 11 ++-
 46 files changed, 289 insertions(+), 191 deletions(-)

diff --git a/content/apache/index.html b/content/apache/index.html
index 3d84f47..01641dc 100644
--- a/content/apache/index.html
+++ b/content/apache/index.html
@@ -2,7 +2,7 @@
 
 
 
-  
+  
   
   
   
@@ -28,7 +28,6 @@
 
 
 
-
   
 
 
@@ -58,6 +57,7 @@
   
 
   
+  https://www.apachecon.com/event-images/snippet.js&quot</a>;>
 
 
 
@@ -77,8 +77,6 @@
   


-Apache
-
 Community
 
 Development
@@ -89,6 +87,8 @@
 
 News
 
+The Apache Software Foundation
+
   
 
  
@@ -112,6 +112,7 @@
 The https://www.apache.org/security/;>Apache Security 
Team
 The https://www.apache.org/foundation/sponsorship.html;>Apache 
Software Foundation Sponsorship Program
 https://www.apache.org/foundation/thanks.html;>Sponsors and 
Thanks
+
 
 
   
@@ -119,7 +120,7 @@
 

 
- 2004-2022 The Apache Software Foundation. Built using the 

[nutch-site] branch main updated (aa45c17 -> db7208f)

2023-07-20 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch main
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


from aa45c17  Announce release of Nutch 1.19 - fix release data in 
announcement
 new 7cd1d1c  - add link / banner of Apache conferences or events - rename 
and move link to ASF
 new 44463bd  Update copyright year 2022 -> 2023
 new db7208f  Add new committer / PMC

The 3 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 config.toml  | 2 +-
 content/apache.md| 4 +++-
 content/community/people-credits.md  | 7 +++
 themes/kube/layouts/_default/baseof.html | 1 +
 themes/kube/layouts/partials/footer.html | 2 +-
 5 files changed, 13 insertions(+), 3 deletions(-)



[nutch-site] branch asf-staging updated: - add new committer / PMC - update copyright year 2022 -> 2023 - add link / banner of Apache conferences or events - rename and move link to ASF

2023-07-20 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/asf-staging by this push:
 new a864887  - add new committer / PMC - update copyright year 2022 -> 
2023 - add link / banner of Apache conferences or events - rename and move link 
to ASF
a864887 is described below

commit a8648878452d3e14d000f3b33558c21fa7ee766c
Author: Sebastian Nagel 
AuthorDate: Thu Jul 20 10:43:11 2023 +0200

- add new committer / PMC
- update copyright year 2022 -> 2023
- add link / banner of Apache conferences or events
- rename and move link to ASF
---
 content/apache/index.html | 11 ++-
 content/categories/index.html | 10 +-
 content/categories/news/index.html| 10 +-
 content/categories/news/page/1/index.html | 11 ++-
 content/categories/page/1/index.html  | 11 ++-
 content/categories/releases/index.html| 10 +-
 content/categories/releases/page/1/index.html | 11 ++-
 content/community/board-reporting/index.html  | 10 +-
 content/community/bot/index.html  | 12 ++--
 content/community/contributing/index.html | 10 +-
 content/community/index.html  | 10 +-
 content/community/index.xml   |  8 
 content/community/mailing-lists/index.html| 10 +-
 content/community/merchandise/index.html  | 10 +-
 content/community/people-credits/index.html   | 17 -
 content/development/index.html| 10 +-
 content/development/issue-tracker/index.html  | 10 +-
 content/development/nightly-builds/index.html | 10 +-
 content/development/source-code-management/index.html | 10 +-
 content/documentation/about/index.html| 10 +-
 content/documentation/faqs/index.html | 10 +-
 content/documentation/index.html  | 10 +-
 content/documentation/javadoc/index.html  | 10 +-
 content/documentation/tutorials/index.html| 10 +-
 content/documentation/wiki/index.html | 10 +-
 content/download/index.html   | 10 +-
 content/index.html| 10 +-
 content/index.xml | 12 ++--
 content/news/index.html   | 10 +-
 content/news/legacy-nutch-news/index.html | 10 +-
 content/news/nutch-1.18-release/index.html| 10 +-
 content/news/nutch-1.19-release/index.html| 10 +-
 content/news/page/1/index.html| 11 ++-
 content/tags/1.18/index.html  | 10 +-
 content/tags/1.18/page/1/index.html   | 11 ++-
 content/tags/1.19/index.html  | 10 +-
 content/tags/1.19/page/1/index.html   | 11 ++-
 content/tags/index.html   | 10 +-
 content/tags/legacy/index.html| 10 +-
 content/tags/legacy/page/1/index.html | 11 ++-
 content/tags/news/index.html  | 10 +-
 content/tags/news/page/1/index.html   | 11 ++-
 content/tags/page/1/index.html| 11 ++-
 content/tags/page/2/index.html| 10 +-
 content/tags/release/index.html   | 10 +-
 content/tags/release/page/1/index.html| 11 ++-
 46 files changed, 289 insertions(+), 191 deletions(-)

diff --git a/content/apache/index.html b/content/apache/index.html
index 3d84f47..01641dc 100644
--- a/content/apache/index.html
+++ b/content/apache/index.html
@@ -2,7 +2,7 @@
 
 
 
-  
+  
   
   
   
@@ -28,7 +28,6 @@
 
 
 
-
   
 
 
@@ -58,6 +57,7 @@
   
 
   
+  https://www.apachecon.com/event-images/snippet.js&quot</a>;>
 
 
 
@@ -77,8 +77,6 @@
   


-Apache
-
 Community
 
 Development
@@ -89,6 +87,8 @@
 
 News
 
+The Apache Software Foundation
+
   
 
  
@@ -112,6 +112,7 @@
 The https://www.apache.org/security/;>Apache Security 
Team
 The https://www.apache.org/foundation/sponsorship.html;>Apache 
Software Foundation Sponsorship Program
 https://www.apache.org/foundation/thanks.html;>Sponsors and 
Thanks
+
 
 
   
@@ -119,7 +120,7 @@
 

 
- 2004-2022 The Apache Software Foundation. Built 

[nutch] branch master updated: NUTCH-2991 Support HTTP/S Header Authorization for Solr connections (#763)

2023-06-06 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 9109bdd74 NUTCH-2991 Support HTTP/S Header Authorization for Solr 
connections (#763)
9109bdd74 is described below

commit 9109bdd740ba578fc17745ebc9f53f464667
Author: Sebastian Nagel 
AuthorDate: Tue Jun 6 14:51:20 2023 +0200

NUTCH-2991 Support HTTP/S Header Authorization for Solr connections (#763)

NUTCH-2991 Support HTTP/S Header Authorization for Solr connections
(patch contributed by Marcos Gomez)
- adds params auth.header.name and auth.header.value for JWT Authentication
  with Bearer Tokens sent via the HTTP Authorization header connections
- also document basic authentication and improve error message when reading 
the configuration fails
---
 conf/index-writers.xml.template| 19 -
 .../org/apache/nutch/indexer/IndexWriters.java |  2 +-
 .../nutch/indexwriter/solr/SolrConstants.java  |  4 +
 .../nutch/indexwriter/solr/SolrIndexWriter.java| 47 ---
 .../apache/nutch/indexwriter/solr/SolrUtils.java   | 94 +-
 5 files changed, 153 insertions(+), 13 deletions(-)

diff --git a/conf/index-writers.xml.template b/conf/index-writers.xml.template
index 549ebd4c9..6ed341cb7 100644
--- a/conf/index-writers.xml.template
+++ b/conf/index-writers.xml.template
@@ -26,9 +26,24 @@
   
   
   
+  
   
-  
-  
+  
+  
+  
+  
+  
+  
+  
 
 
   
diff --git a/src/java/org/apache/nutch/indexer/IndexWriters.java 
b/src/java/org/apache/nutch/indexer/IndexWriters.java
index a8ab0ec9c..f8ae8ee86 100644
--- a/src/java/org/apache/nutch/indexer/IndexWriters.java
+++ b/src/java/org/apache/nutch/indexer/IndexWriters.java
@@ -137,7 +137,7 @@ public class IndexWriters {
 
   return indexWriterConfigs;
 } catch (SAXException | IOException | ParserConfigurationException e) {
-  LOG.error(e.toString());
+  LOG.error("Failed to read index writers configuration: {}", 
e.getMessage());
   return new IndexWriterConfig[0];
 }
   }
diff --git 
a/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
 
b/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
index 302ed75ed..ee6d5d623 100644
--- 
a/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
+++ 
b/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
@@ -34,4 +34,8 @@ public interface SolrConstants {
 
   String PASSWORD = "password";
 
+  String AUTH_HEADER_NAME = "auth.header.name";
+
+  String AUTH_HEADER_VALUE = "auth.header.value";
+
 }
diff --git 
a/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
 
b/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
index 12d3ff6b7..ec2ab46d2 100644
--- 
a/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
+++ 
b/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
@@ -16,8 +16,8 @@
  */
 package org.apache.nutch.indexwriter.solr;
 
-import java.lang.invoke.MethodHandles;
 import java.io.IOException;
+import java.lang.invoke.MethodHandles;
 import java.time.format.DateTimeFormatter;
 import java.util.AbstractMap;
 import java.util.ArrayList;
@@ -72,6 +72,8 @@ public class SolrIndexWriter implements IndexWriter {
   private boolean auth;
   private String username;
   private String password;
+  private String authHeaderName;
+  private String authHeaderValue;
 
   @Override
   public void open(Configuration conf, String name) {
@@ -99,20 +101,40 @@ public class SolrIndexWriter implements IndexWriter {
 this.auth = parameters.getBoolean(SolrConstants.USE_AUTH, false);
 this.username = parameters.get(SolrConstants.USERNAME);
 this.password = parameters.get(SolrConstants.PASSWORD);
+this.authHeaderName = parameters.get(SolrConstants.AUTH_HEADER_NAME, "");
+this.authHeaderValue = parameters.get(SolrConstants.AUTH_HEADER_VALUE, "");
 
 this.solrClients = new ArrayList<>();
 
 switch (type) {
 case "http":
   for (String url : urls) {
-solrClients.add(SolrUtils.getHttpSolrClient(url));
+if (this.auth && !StringUtil.isEmpty(this.authHeaderName)
+&& !StringUtil.isEmpty(this.authHeaderValue)) {
+  solrClients.add(SolrUtils.getHttpSolrClientHeaderAuthorization(url,
+  this.authHeaderName, this.authHeaderValue));
+} else if (this.auth && !StringUtil.isEmpty(this.username)
+&& !StringUtil.isEmpty(this.password)) {
+  solr

[nutch] branch master updated: NUTCH-2992 Fetcher: always block fetch queues when exceptions threshold is reached - if QueueFeeder is still alive, also block queues which are empty right now

2023-05-23 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 98d02e70f NUTCH-2992 Fetcher: always block fetch queues when 
exceptions threshold is reached - if QueueFeeder is still alive, also block 
queues which are empty right now
98d02e70f is described below

commit 98d02e70f6d83f4fb99abf89a990a3e13a933076
Author: Sebastian Nagel 
AuthorDate: Tue May 16 17:30:49 2023 +0200

NUTCH-2992 Fetcher: always block fetch queues when exceptions threshold is 
reached
- if QueueFeeder is still alive, also block queues which are empty right now
---
 .../org/apache/nutch/fetcher/FetchItemQueues.java  | 25 --
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/src/java/org/apache/nutch/fetcher/FetchItemQueues.java 
b/src/java/org/apache/nutch/fetcher/FetchItemQueues.java
index 9dfbeb277..cec272b45 100644
--- a/src/java/org/apache/nutch/fetcher/FetchItemQueues.java
+++ b/src/java/org/apache/nutch/fetcher/FetchItemQueues.java
@@ -303,19 +303,22 @@ public class FetchItemQueues {
   "* queue: {} >> delayed next fetch by {} ms after {} exceptions in 
queue",
   queueid, exceptionDelay, excCount);
 }
-if (fiq.getQueueSize() == 0) {
-  return 0;
-}
-if (maxExceptions!= -1 && excCount >= maxExceptions) {
+if (maxExceptions != -1 && excCount >= maxExceptions) {
   // too many exceptions for items in this queue - purge it
   int deleted = fiq.emptyQueue();
-  LOG.info(
-  "* queue: {} >> removed {} URLs from queue because {} exceptions 
occurred",
-  queueid, deleted, excCount);
-  totalSize.getAndAdd(-deleted);
-  // keep queue IDs to ensure that these queues aren't created and filled
-  // again, see addFetchItem(FetchItem)
-  queuesMaxExceptions.add(queueid);
+  if (deleted > 0) {
+LOG.info(
+"* queue: {} >> removed {} URLs from queue because {} exceptions 
occurred",
+queueid, deleted, excCount);
+totalSize.getAndAdd(-deleted);
+  }
+  if (feederAlive) {
+LOG.info("* queue: {} >> blocked after {} exceptions", queueid,
+excCount);
+// keep queue IDs to ensure that these queues aren't created and filled
+// again, see addFetchItem(FetchItem)
+queuesMaxExceptions.add(queueid);
+  }
   return deleted;
 }
 return 0;



[nutch] branch master updated: NUTCH-2596 Upgrade from org.mortbay.jetty to org.eclipse.jetty - upgrade from org.mortbay.jetty 6.1.26 to org.eclipse.jetty 9.4.50 (Hadoop depends on 9.4.43) - remove

2023-03-17 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 215993bc6 NUTCH-2596 Upgrade from org.mortbay.jetty to 
org.eclipse.jetty - upgrade from org.mortbay.jetty 6.1.26 to org.eclipse.jetty 
9.4.50   (Hadoop depends on 9.4.43) - remove obsolete dependency exclusions of 
hadoop-common - upgrade Fetcher unit tests to use org.eclipse.jetty
215993bc6 is described below

commit 215993bc6fbc58c050251410d5a7b02e601d99b3
Author: Sebastian Nagel 
AuthorDate: Thu Feb 23 15:46:28 2023 +0100

NUTCH-2596 Upgrade from org.mortbay.jetty to org.eclipse.jetty
- upgrade from org.mortbay.jetty 6.1.26 to org.eclipse.jetty 9.4.50
  (Hadoop depends on 9.4.43)
- remove obsolete dependency exclusions of hadoop-common
- upgrade Fetcher unit tests to use org.eclipse.jetty
---
 ivy/ivy.xml  | 12 +++-
 src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java | 17 +
 src/test/org/apache/nutch/fetcher/TestFetcher.java   |  5 ++---
 3 files changed, 14 insertions(+), 20 deletions(-)

diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index 36a32a809..269f521c8 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -50,14 +50,7 @@

 

-   
-   
-   
-   
-   
-   
-   
-   
+   



@@ -112,7 +105,8 @@


 
-   
+   
+   
 


diff --git a/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java 
b/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java
index e271e88cf..87da8faf2 100644
--- a/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java
+++ b/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java
@@ -48,10 +48,10 @@ import org.apache.hadoop.mapreduce.Partitioner;
 import org.apache.hadoop.mapreduce.TaskAttemptID;
 import org.apache.hadoop.mapreduce.Reducer.Context;
 import org.apache.hadoop.security.Credentials;
-import org.mortbay.jetty.Server;
-import org.mortbay.jetty.bio.SocketConnector;
-import org.mortbay.jetty.handler.ContextHandler;
-import org.mortbay.jetty.handler.ResourceHandler;
+import org.eclipse.jetty.server.Server;
+import org.eclipse.jetty.server.ServerConnector;
+import org.eclipse.jetty.server.handler.ContextHandler;
+import org.eclipse.jetty.server.handler.ResourceHandler;
 
 public class CrawlDBTestUtil {
 
@@ -435,16 +435,17 @@ public class CrawlDBTestUtil {
*/
   public static Server getServer(int port, String staticContent)
   throws UnknownHostException {
-Server webServer = new org.mortbay.jetty.Server();
-SocketConnector listener = new SocketConnector();
+Server webServer = new Server();
+
+ServerConnector listener = new ServerConnector(webServer);
 listener.setPort(port);
 listener.setHost("127.0.0.1");
 webServer.addConnector(listener);
 ContextHandler staticContext = new ContextHandler();
 staticContext.setContextPath("/");
 staticContext.setResourceBase(staticContent);
-staticContext.addHandler(new ResourceHandler());
-webServer.addHandler(staticContext);
+staticContext.insertHandler(new ResourceHandler());
+webServer.insertHandler(staticContext);
 return webServer;
   }
 }
diff --git a/src/test/org/apache/nutch/fetcher/TestFetcher.java 
b/src/test/org/apache/nutch/fetcher/TestFetcher.java
index 245353fad..ecc135c52 100644
--- a/src/test/org/apache/nutch/fetcher/TestFetcher.java
+++ b/src/test/org/apache/nutch/fetcher/TestFetcher.java
@@ -36,7 +36,7 @@ import org.junit.After;
 import org.junit.Assert;
 import org.junit.Before;
 import org.junit.Test;
-import org.mortbay.jetty.Server;
+import org.eclipse.jetty.server.Server;
 
 /**
  * Basic fetcher test 1. generate seedlist 2. inject 3. generate 3. fetch 4.
@@ -180,8 +180,7 @@ public class TestFetcher {
   }
 
   private void addUrl(ArrayList urls, String page) {
-urls.add("http://127.0.0.1:; + server.getConnectors()[0].getPort() + "/"
-+ page);
+urls.add("http://127.0.0.1:; + server.getURI().getPort() + "/" + page);
   }
 
   @Test



[nutch] branch master updated: NUTCH-2984 Drop test proxy server and benchmark tool

2023-03-17 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new b4cb5c1e3 NUTCH-2984 Drop test proxy server and benchmark tool
b4cb5c1e3 is described below

commit b4cb5c1e30a37b7eceed477fe2d71011bde042ed
Author: Sebastian Nagel 
AuthorDate: Fri Feb 24 15:27:35 2023 +0100

NUTCH-2984 Drop test proxy server and benchmark tool
---
 build.xml  |  33 ---
 ivy/ivy.xml|   1 -
 src/java/org/apache/nutch/tools/Benchmark.java | 289 -
 .../nutch/tools/proxy/AbstractTestbedHandler.java  |  49 
 .../org/apache/nutch/tools/proxy/DelayHandler.java |  55 
 .../org/apache/nutch/tools/proxy/FakeHandler.java  | 101 ---
 .../apache/nutch/tools/proxy/LogDebugHandler.java  |  64 -
 .../apache/nutch/tools/proxy/NotFoundHandler.java  |  39 ---
 .../org/apache/nutch/tools/proxy/ProxyTestbed.java | 157 ---
 .../apache/nutch/tools/proxy/SegmentHandler.java   | 255 --
 .../org/apache/nutch/tools/proxy/package-info.java |  22 --
 11 files changed, 1065 deletions(-)

diff --git a/build.xml b/build.xml
index cc88493f3..9326a8ba2 100644
--- a/build.xml
+++ b/build.xml
@@ -468,39 +468,6 @@
 
   
 
-  
-  
-  
-
-  
-
-  
-  
-
-  
-
-  
-
-  
-  
-  
-
-  
-
-  
-  
-  
-  
-  
-  
-  
-  
-
-  
-
   
   
   
diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index 0e7e25160..36a32a809 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -112,7 +112,6 @@


 
-   

 

diff --git a/src/java/org/apache/nutch/tools/Benchmark.java 
b/src/java/org/apache/nutch/tools/Benchmark.java
deleted file mode 100644
index d7c3b74ae..0
--- a/src/java/org/apache/nutch/tools/Benchmark.java
+++ /dev/null
@@ -1,289 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-package org.apache.nutch.tools;
-
-import java.io.OutputStream;
-import java.lang.invoke.MethodHandles;
-import java.text.SimpleDateFormat;
-import java.util.ArrayList;
-import java.util.Date;
-import java.util.HashMap;
-import java.util.List;
-import java.util.Map;
-
-import org.apache.hadoop.conf.Configuration;
-import org.apache.hadoop.conf.Configured;
-import org.apache.hadoop.fs.FileSystem;
-import org.apache.hadoop.fs.Path;
-import org.apache.hadoop.util.Tool;
-import org.apache.hadoop.mapreduce.Job;
-import org.apache.hadoop.util.ToolRunner;
-import org.apache.nutch.crawl.CrawlDb;
-import org.apache.nutch.crawl.CrawlDbReader;
-import org.apache.nutch.crawl.Generator;
-import org.apache.nutch.crawl.Injector;
-import org.apache.nutch.crawl.LinkDb;
-import org.apache.nutch.fetcher.Fetcher;
-import org.apache.nutch.parse.ParseSegment;
-import org.apache.nutch.util.NutchConfiguration;
-import org.apache.nutch.util.NutchJob;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
-
-public class Benchmark extends Configured implements Tool {
-  
-  private static final Logger LOG = LoggerFactory
- .getLogger(MethodHandles.lookup().lookupClass());
-
-  public static void main(String[] args) throws Exception {
-Configuration conf = NutchConfiguration.create();
-int res = ToolRunner.run(conf, new Benchmark(), args);
-System.exit(res);
-  }
-
-  @SuppressWarnings("unused")
-  private static String getDate() {
-return new SimpleDateFormat("MMddHHmmss").format(new Date(System
-.currentTimeMillis()));
-  }
-
-  private void createSeeds(FileSystem fs, Path seedsDir, int count)
-  throws Exception {
-OutputStream os = fs.create(new Path(seedsDir, "seeds"));
-for (int i = 0; i < count; i++) {
-  String url = "http://www.test-; + i + ".com/\r\n";
-  os.write(url.getBytes());
-}
-os.flush();
-os.close();
-  }
-
-  public static final class BenchmarkResults {
-Map> timings = new HashMap<>();
-List runs = new ArrayList<>();

[nutch] branch master updated: NUTCH-2985 Disable plugin urlfilter-validator by default

2023-03-06 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 1999b1e11 NUTCH-2985 Disable plugin urlfilter-validator by default
1999b1e11 is described below

commit 1999b1e1199b773c8d08e4765cfa1824e99a9287
Author: Sebastian Nagel 
AuthorDate: Fri Feb 24 16:24:21 2023 +0100

NUTCH-2985 Disable plugin urlfilter-validator by default
---
 conf/nutch-default.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 69351c843..273cfccc5 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -1590,7 +1590,7 @@
 
 
   plugin.includes
-  
protocol-http|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
+  
protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
   Regular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   By default Nutch includes plugins to crawl HTML and various other



[nutch] branch master updated: NUTCH-2983 nutch-default.xml improvements - remove property "hadoop.job.history.user.location", obsolete since Hadoop 0.21.0 - normalize spelling (case) of URL and Crawl

2023-03-06 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new c8aecfa5d NUTCH-2983 nutch-default.xml improvements - remove property 
"hadoop.job.history.user.location", obsolete since Hadoop 0.21.0 - normalize 
spelling (case) of URL and CrawlDb - trim trailing space - fix typos - improve 
description of properties {db,linkdb}.ignore.{ex,in}ternal.links
c8aecfa5d is described below

commit c8aecfa5de609f8d7f0744bc1a1dea525e09ebe9
Author: Sebastian Nagel 
AuthorDate: Fri Feb 17 17:18:32 2023 +0100

NUTCH-2983 nutch-default.xml improvements
- remove property "hadoop.job.history.user.location", obsolete since Hadoop 
0.21.0
- normalize spelling (case) of URL and CrawlDb
- trim trailing space
- fix typos
- improve description of properties {db,linkdb}.ignore.{ex,in}ternal.links
---
 conf/nutch-default.xml | 278 -
 1 file changed, 137 insertions(+), 141 deletions(-)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index d05503d23..69351c843 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -33,7 +33,7 @@
   confuse this setting with the http.content.limit setting.
   
 
-  
+
 
   file.crawl.parent
   true
@@ -72,7 +72,7 @@
 
   http.agent.name
   
-  HTTP 'User-Agent' request header. MUST NOT be empty - 
+  HTTP 'User-Agent' request header. MUST NOT be empty -
   please set this to a single word uniquely related to your organization.
 
   NOTE: You should also check other related properties:
@@ -92,23 +92,23 @@
   http.robots.agents
   
   Any other agents, apart from 'http.agent.name', that the robots
-  parser would look for in robots.txt. Multiple agents can be provided using 
+  parser would look for in robots.txt. Multiple agents can be provided using
   comma as a delimiter. eg. mybot,foo-spider,bar-crawler
-  
-  The ordering of agents does NOT matter and the robots parser would make 
-  decision based on the agent which matches first to the robots rules.  
-  Also, there is NO need to add a wildcard (ie. "*") to this string as the 
-  robots parser would smartly take care of a no-match situation. 
-
-  If no value is specified, by default HTTP agent (ie. 'http.agent.name') 
-  would be used for user agent matching by the robots parser. 
+
+  The ordering of agents does NOT matter and the robots parser would make
+  decision based on the agent which matches first to the robots rules.
+  Also, there is NO need to add a wildcard (ie. "*") to this string as the
+  robots parser would smartly take care of a no-match situation.
+
+  If no value is specified, by default HTTP agent (ie. 'http.agent.name')
+  would be used for user agent matching by the robots parser.
   
 
 
 
   http.robot.rules.allowlist
   
-  Comma separated list of hostnames or IP addresses to ignore 
+  Comma separated list of hostnames or IP addresses to ignore
   robot rules parsing for. Use with care and only if you are explicitly
   allowed by the site owner to ignore the site's robots.txt!
   Also keep in mind: ignoring the robots.txt rules means that no robots.txt
@@ -166,7 +166,7 @@
 
   http.agent.url
   
-  A URL to advertise in the User-Agent header.  This will 
+  A URL to advertise in the User-Agent header.  This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
@@ -185,7 +185,7 @@
 
   http.agent.version
   Nutch-1.20-SNAPSHOT
-  A version string to advertise in the User-Agent 
+  A version string to advertise in the User-Agent
header.
 
 
@@ -346,7 +346,7 @@
 
   http.proxy.exception.list
   
-  A comma separated list of hosts that don't use the proxy 
+  A comma separated list of hosts that don't use the proxy
   (e.g. intranets). Example: www.apache.org
 
 
@@ -377,7 +377,7 @@
   Value of the "Accept-Language" request header field.
   This allows selecting non-English language as default one to retrieve.
   It is a useful setting for search engines build for certain national group.
-  To send requests without "Accept-Language" header field, thi  property must
+  To send requests without "Accept-Language" header field, this property must
   be configured to contain a space character because an empty property does
   not overwrite the default.
   
@@ -402,8 +402,8 @@
 
   http.store.responsetime
   true
-  Enables us to record the response time of the 
-  host which is the time period between start connection to end 
+  Enables us to record the response time of the
+  host which is the time period between start connection to end
   connection of a pages host. The response time in milliseconds
   is stored in CrawlDb in CrawlDatum's meta data

[nutch] branch master updated: NUTCH-2972 Javadoc build fails using JDK 17 - fix Javadoc issues when building with JDK 17

2023-03-06 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new a92878df1 NUTCH-2972 Javadoc build fails using JDK 17 - fix Javadoc 
issues when building with JDK 17
a92878df1 is described below

commit a92878df1ea586057dc8bc7e9ade376a9b8edc20
Author: Sebastian Nagel 
AuthorDate: Fri Feb 24 17:16:27 2023 +0100

NUTCH-2972 Javadoc build fails using JDK 17
- fix Javadoc issues when building with JDK 17
---
 src/java/org/apache/nutch/segment/SegmentMerger.java | 14 --
 src/java/org/apache/nutch/tools/arc/ArcRecordReader.java | 16 +++-
 .../apache/nutch/urlfilter/suffix/SuffixURLFilter.java   |  8 +---
 3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/src/java/org/apache/nutch/segment/SegmentMerger.java 
b/src/java/org/apache/nutch/segment/SegmentMerger.java
index 056df3c88..6bb90e472 100644
--- a/src/java/org/apache/nutch/segment/SegmentMerger.java
+++ b/src/java/org/apache/nutch/segment/SegmentMerger.java
@@ -76,7 +76,9 @@ import org.apache.nutch.util.NutchJob;
  * 
  * Also, it's possible to slice the resulting segment into chunks of fixed 
size.
  * 
- * Important Notes Which parts are merged?
+ * 
+ * Important Notes
+ * Which parts are merged?
  * 
  * It doesn't make sense to merge data from segments, which are at different
  * stages of processing (e.g. one unfetched segment, one fetched but not 
parsed,
@@ -87,14 +89,14 @@ import org.apache.nutch.util.NutchJob;
  * fall back to just merging fetchlists, and it will skip all other data from
  * all segments.
  * 
- * Merging fetchlists
+ * Merging fetchlists
  * 
  * Merging segments, which contain just fetchlists (i.e. prior to fetching) is
  * not recommended, because this tool (unlike the
  * {@link org.apache.nutch.crawl.Generator} doesn't ensure that fetchlist parts
  * for each map task are disjoint.
  * 
- * Duplicate content
+ * Duplicate content
  * Merging segments removes older content whenever possible (see below).
  * However, this is NOT the same as de-duplication, which in addition removes
  * identical content found at different URL-s. In other words, running
@@ -108,15 +110,15 @@ import org.apache.nutch.util.NutchJob;
  * segments be named in an increasing lexicographic order as their creation 
time
  * increases.
  * 
- * Merging and indexes
+ * Merging and indexes
  * 
  * Merged segment gets a different name. Since Indexer embeds segment names in
  * indexes, any indexes originally created for the input segments will NOT work
  * with the merged segment. Newly created merged segment(s) need to be indexed
  * afresh. This tool doesn't use existing indexes in any way, so if you plan to
  * merge segments you don't have to index them prior to merging.
- * 
- * @author Andrzej Bialecki
+ * 
+ * 
  */
 public class SegmentMerger extends Configured implements Tool{
   private static final Logger LOG = LoggerFactory
diff --git a/src/java/org/apache/nutch/tools/arc/ArcRecordReader.java 
b/src/java/org/apache/nutch/tools/arc/ArcRecordReader.java
index 0a93947e4..b514a63fc 100644
--- a/src/java/org/apache/nutch/tools/arc/ArcRecordReader.java
+++ b/src/java/org/apache/nutch/tools/arc/ArcRecordReader.java
@@ -38,19 +38,17 @@ import org.apache.hadoop.util.ReflectionUtils;
 /**
  * The ArchRecordReader class provides a record reader which reads
  * records from arc files.
- * 
+ * 
  * Arc files are essentially tars of gzips. Each record in an arc file is a
  * compressed gzip. Multiple records are concatenated together to form a
- * complete arc. 
- * For more information on the arc file format 
- * @see ArcFileFormat.
- * 
+ * complete arc.
  * 
- * 
- * Arc files are used by the internet archive and grub projects.
- * 
+ * For more information on the arc file format 
+ * @see ArcFileFormat.
+
+ * Arc files are used by the Internet Archive and grub projects.
  * 
- * @see archive.org 
+ * @see archive.org
  * @see grub.org
  */
 public class ArcRecordReader extends RecordReader {
diff --git 
a/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
 
b/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
index dd8605f79..5edf5fc38 100644
--- 
a/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
+++ 
b/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
@@ -78,6 +78,9 @@ import java.net.MalformedURLException;
  * expressions, it only accepts literal suffixes. I.e. a suffix "+*.jpg" is 
most
  * probably wrong, you should use "+.jpg" instead.
  * 
+ * 
+ * 
+ * Examples
  * Example 1
  * 
  * The configuration shown below will accept all URLs with '.html' or '.htm'
@@ -96,7 +99,7 @@ import java.net.MalformedURLException;
  *  .htm
  *

[nutch] branch master updated: NUTCH-2982 Generator: parameter for URL normalization not passed forward - pass forward params `norm` and `maxNumSegments` - fix typos in Javadoc

2023-03-06 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new ef2949691 NUTCH-2982 Generator: parameter for URL normalization not 
passed forward - pass forward params `norm` and `maxNumSegments` - fix typos in 
Javadoc
ef2949691 is described below

commit ef29496915d2c230466412d99ac4236a8e647932
Author: Sebastian Nagel 
AuthorDate: Fri Feb 17 16:12:26 2023 +0100

NUTCH-2982 Generator: parameter for URL normalization not passed forward
- pass forward params `norm` and `maxNumSegments`
- fix typos in Javadoc
---
 src/java/org/apache/nutch/crawl/Generator.java | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/java/org/apache/nutch/crawl/Generator.java 
b/src/java/org/apache/nutch/crawl/Generator.java
index 8a2f87ba4..8e085428d 100644
--- a/src/java/org/apache/nutch/crawl/Generator.java
+++ b/src/java/org/apache/nutch/crawl/Generator.java
@@ -750,7 +750,7 @@ public class Generator extends NutchTool implements Tool {
* @param curTime
*  Current time in milliseconds
* @param filter whether to apply filtering operation
-   * @param norm whether to apply normilization operation
+   * @param norm whether to apply normalization operation
* @param force if true, and the target lockfile exists, consider it valid. 
If false
*  and the target file exists, throw an IOException.
* @param maxNumSegments maximum number of segments to generate
@@ -768,8 +768,8 @@ public class Generator extends NutchTool implements Tool {
   long curTime, boolean filter, boolean norm, boolean force,
   int maxNumSegments, String expr)
   throws IOException, InterruptedException, ClassNotFoundException {
-return generate(dbDir, segments, numLists, topN, curTime, filter, true,
-force, 1, expr, null);
+return generate(dbDir, segments, numLists, topN, curTime, filter, norm,
+force, maxNumSegments, expr, null);
   }
 
   /**
@@ -789,7 +789,7 @@ public class Generator extends NutchTool implements Tool {
* @param curTime
*  Current time in milliseconds
* @param filter whether to apply filtering operation
-   * @param norm whether to apply normilization operation
+   * @param norm whether to apply normalization operation
* @param force if true, and the target lockfile exists, consider it valid. 
If false
*  and the target file exists, throw an IOException.
* @param maxNumSegments maximum number of segments to generate



[nutch] 01/07: NUTCH-2920 -- first working attempt at migrating ElasticsearchIndexWriter to OpenSearch

2023-03-06 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit ca3824fd98290dd7806752decfab6eb9e3b3b569
Author: tallison 
AuthorDate: Fri Feb 24 14:48:55 2023 -0500

NUTCH-2920 -- first working attempt at migrating ElasticsearchIndexWriter 
to OpenSearch
---
 LICENSE-binary |   1 +
 NOTICE-binary  |   4 +
 conf/index-writers.xml.template|  27 ++
 src/plugin/build.xml   |   1 +
 src/plugin/indexer-opensearch-1x/README.md |  44 +++
 src/plugin/indexer-opensearch-1x/build-ivy.xml |  47 +++
 src/plugin/indexer-opensearch-1x/build.xml |  32 ++
 src/plugin/indexer-opensearch-1x/ivy.xml   |  46 +++
 src/plugin/indexer-opensearch-1x/plugin.xml|  76 
 .../opensearch1x/OpenSearch1xConstants.java|  38 ++
 .../opensearch1x/OpenSearch1xIndexWriter.java  | 419 +
 .../indexwriter/opensearch1x/package-info.java |  22 ++
 12 files changed, 757 insertions(+)

diff --git a/LICENSE-binary b/LICENSE-binary
index d07d0a6a3..8e24a728e 100644
--- a/LICENSE-binary
+++ b/LICENSE-binary
@@ -505,6 +505,7 @@ org.jetbrains.kotlin:kotlin-stdlib-jdk8
 org.lz4:lz4-java
 org.mapdb:mapdb
 org.netpreserve.commons:webarchive-commons
+org.opensearch.client:opensearch-rest-high-level-client
 org.seleniumhq.selenium:htmlunit-driver
 org.seleniumhq.selenium:selenium-api
 org.seleniumhq.selenium:selenium-chrome-driver
diff --git a/NOTICE-binary b/NOTICE-binary
index 83d65ffaf..1aab2cb41 100644
--- a/NOTICE-binary
+++ b/NOTICE-binary
@@ -1021,6 +1021,10 @@ mapdb (http://www.mapdb.org)
 webarchive-commons (https://github.com/iipc/webarchive-commons)
 - license: The Apache Software License, Version 2.0
 
+# org.opensearch.client:opensearch-rest-high-level-client
+opensearch-rest-high-level-client (https://opensearch.org/)
+- license: The Apache Software License, Version 2.0
+
 # org.ow2.asm:asm
 asm (http://asm.ow2.io/)
 - license: BSD-3-Clause
diff --git a/conf/index-writers.xml.template b/conf/index-writers.xml.template
index 9f5d7916c..221f5affe 100644
--- a/conf/index-writers.xml.template
+++ b/conf/index-writers.xml.template
@@ -128,6 +128,33 @@
   
 
   
+  
+
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+
+
+  
+
+  
+  
+  
+
+  
   
 
   
diff --git a/src/plugin/build.xml b/src/plugin/build.xml
index db7d4d560..4d900c390 100755
--- a/src/plugin/build.xml
+++ b/src/plugin/build.xml
@@ -54,6 +54,7 @@
 
 
 
+
 
 
 
diff --git a/src/plugin/indexer-opensearch-1x/README.md 
b/src/plugin/indexer-opensearch-1x/README.md
new file mode 100644
index 0..b68557fae
--- /dev/null
+++ b/src/plugin/indexer-opensearch-1x/README.md
@@ -0,0 +1,44 @@
+indexer-opensearch1x plugin for Nutch 
+
+
+**indexer-opensearch1x plugin** is used for sending documents from one or more 
segments to an OpenSearch server. The configuration for the index writers is on 
**conf/index-writers.xml** file, included in the official Nutch distribution 
and it's as follow:
+
+```xml
+
+  
+...
+  
+  
+...
+ 
+
+```
+
+Each `` element has two mandatory attributes:
+
+* `` is a unique identification for each configuration. This 
feature allows Nutch to distinguish each configuration, even when they are for 
the same index writer. In addition, it allows to have multiple instances for 
the same index writer, but with different configurations.
+
+* `org.apache.nutch.indexwriter.opensearch1x.OpenSearch1x.IndexWriter` 
corresponds to the canonical name of the class that implements the IndexWriter 
extension point. This value should not be modified for the 
**indexer-opensearch1x plugin**.
+
+## Mapping
+
+The mapping section is explained 
[here](https://cwiki.apache.org/confluence/display/NUTCH/IndexWriters#IndexWriters-Mappingsection).
 The structure of this section is general for all index writers.
+
+## Parameters
+
+Each parameter has the form `` and the 
parameters for this index writer are:
+
+Parameter Name | Description | Default value
+--|--|--
+host | Comma-separated list of hostnames to send documents to using 
[TransportClient](https://static.javadoc.io/org.opensearch/opensearch/1.3.8/org/opensearch/client/transport/TransportClient.html).
 Either host and port must be defined. | 
+port | The port to connect to using 
[TransportClient](https://static.javadoc.io/org.opensearch/opensearch/1.3.8/org/opensearch/client/transport/TransportClient.html).
 | 9300
+scheme | The scheme (http or https) to connect to OpenSearch server. | https
+index | Default index to send documents to. | nutch
+username | Username for auth credentials | admin
+password | Password

[nutch] 06/07: fix template to include new key store info. remove unused auth

2023-03-06 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit e03cad3f42b9be16f45b2012fc738106894ac332
Author: tallison 
AuthorDate: Wed Mar 1 15:34:08 2023 -0500

fix template to include new key store info.  remove unused auth
---
 conf/index-writers.xml.template | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/conf/index-writers.xml.template b/conf/index-writers.xml.template
index 221f5affe..549ebd4c9 100644
--- a/conf/index-writers.xml.template
+++ b/conf/index-writers.xml.template
@@ -136,10 +136,12 @@
   
   
   
-  
   
   
   
+  
+  
+  
   
   
   



[nutch] 05/07: NUTCH-2920 -- improve username/pw logic and update README.md

2023-03-06 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 71fabb2a87ff81b78997133ab7c790afa4ea6157
Author: tallison 
AuthorDate: Wed Mar 1 13:48:57 2023 -0500

NUTCH-2920 -- improve username/pw logic and update README.md
---
 src/plugin/indexer-opensearch-1x/README.md | 24 +-
 .../opensearch1x/OpenSearch1xIndexWriter.java  | 10 ++---
 2 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/src/plugin/indexer-opensearch-1x/README.md 
b/src/plugin/indexer-opensearch-1x/README.md
index b68557fae..52e5844af 100644
--- a/src/plugin/indexer-opensearch-1x/README.md
+++ b/src/plugin/indexer-opensearch-1x/README.md
@@ -36,9 +36,31 @@ scheme | The scheme (http or https) to connect to OpenSearch 
server. | https
 index | Default index to send documents to. | nutch
 username | Username for auth credentials | admin
 password | Password for auth credentials | admin
-auth | Whether to enable HTTP basic authentication with OpenSearch. Use 
`username` and `password` properties to configure your credentials. | false
+trust.store.path | Path to the trust store |
+trust.store.password | Password for trust store |
+trust.store.type | Type of trust store | JKS
+key.store.path | Path to the key store |
+key.store.password | Password for the key and the key store |
+key.store.type | Type of key store | JKS
 max.bulk.docs | Maximum size of the bulk in number of documents. | 250
 max.bulk.size | Maximum size of the bulk in bytes. | 2500500
 exponential.backoff.millis | Initial delay for the 
[BulkProcessor](https://static.javadoc.io/org.opensearch/opensearch/1.3.8/org/opensearch/action/bulk/BulkProcessor.html)
 exponential backoff policy. | 100
 exponential.backoff.retries | Number of times the 
[BulkProcessor](https://static.javadoc.io/org.opensearch/opensearch/1.3.8/org/opensearch/action/bulk/BulkProcessor.html)
 exponential backoff policy should retry bulk operations. | 10
 bulk.close.timeout | Number of seconds allowed for the 
[BulkProcessor](https://static.javadoc.io/org.opensearch/opensearch/1.3.8/org/opensearch/action/bulk/BulkProcessor.html)
 to complete its last operation. | 600
+
+## Authentication and SSL/TLS
+
+It is highly recommended that users use at least basic authentication (modify 
the `username` and `password`!!!) and that they set up at least the trust store 
(1-way TLS).
+For a "getting started" level introduction to setting up a trust store, see: 
[Connecting 
java-high-level-rest-client](https://opensearch.org/blog/connecting-java-high-level-rest-client-with-opensearch-over-https/).
+For a more in depth treatment, see: [Configuring TLS 
certificates](https://opensearch.org/docs/latest/security/configuration/tls/).
+
+Users may opt for 2-way TLS and skip basic authentication (`username` and 
`password`).  
+To do this, specify both the `trust.store.*` parameters and the `key.store.*` 
parameters.
+
+If users do not specify at least 1-way TLS (trust-store), this indexer logs a 
warning that this is a bad idea(TM), and it will proceed by completely ignoring 
all the SSL security.
+
+## Design
+This index writer was built to be as close as possible to Nutch's existing 
indexer-elastic code. We
+therefore chose to use the to-be-deprecated-in-3.x 
`opensearch-rest-high-level-client`.
+We should plan to migrate to the `java client` for 2.x, whenever the 
BulkProcessor has been added.
+See the discussion on 
[NUTCH-2920](https://issues.apache.org/jira/projects/NUTCH/issues/NUTCH-2920).
\ No newline at end of file
diff --git 
a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java
 
b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java
index ec516e250..878c55a09 100644
--- 
a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java
+++ 
b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java
@@ -194,6 +194,10 @@ public class OpenSearch1xIndexWriter implements 
IndexWriter {
 keyStorePath = parameters.get(OpenSearch1xConstants.KEY_STORE_PATH);
 keyStorePassword = 
parameters.get(OpenSearch1xConstants.KEY_STORE_PASSWORD);
 keyStoreType = parameters.get(OpenSearch1xConstants.KEY_STORE_TYPE, "JKS");
+
+if (! StringUtils.isAllBlank(user) && password == null) {
+  throw new IllegalArgumentException("Must specify a password, even if 
empty, if a 'user' is specified.");
+}
 boolean basicAuth = user != null && password != null;
 
 final CredentialsProvider credentialsProvider = new 
BasicCredentialsProvider();
@@ -262,9 +266,9 @@ public class OpenSearch1xIndexWriter implements IndexWriter 
{
   sslBuilder.loadTrustMaterial(trustStore.get(), null);

[nutch] 07/07: Add indexer-opensearch-1x to 4 more targets...feedback from sebastian-nagel

2023-03-06 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit e8fd21090c0a1e387ee3b5796b7a3be11cf91293
Author: tballison 
AuthorDate: Fri Mar 3 14:48:20 2023 -0500

Add indexer-opensearch-1x to 4 more targets...feedback from sebastian-nagel
---
 build.xml| 3 +++
 src/plugin/build.xml | 1 +
 2 files changed, 4 insertions(+)

diff --git a/build.xml b/build.xml
index 594fabc24..cc88493f3 100644
--- a/build.xml
+++ b/build.xml
@@ -221,6 +221,7 @@
   
   
   
+  
   
   
   
@@ -738,6 +739,7 @@
   
   
   
+  
   
   
   
@@ -1242,6 +1244,7 @@
 
 
 
+
 
 
 
diff --git a/src/plugin/build.xml b/src/plugin/build.xml
index 4d900c390..e83f25273 100755
--- a/src/plugin/build.xml
+++ b/src/plugin/build.xml
@@ -195,6 +195,7 @@
 
 
 
+
 
 
 



[nutch] branch master updated (383aeca5d -> e8fd21090)

2023-03-06 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from 383aeca5d NUTCH-2980: Upgraded Selenium to 4.7.2 + HTMLUnit
 new ca3824fd9 NUTCH-2920 -- first working attempt at migrating 
ElasticsearchIndexWriter to OpenSearch
 new 6e149f495 NUTCH-2920 -- fix imports
 new f6b17177a NUTCH-2920 -- add keystore for 2-way tls; add back in no-tls 
option with a stern warning and possibly helpful links.
 new 5fc2839c4 NUTCH-2920 -- improve handling for missing trust.store.path 
in the index-writers.xml
 new 71fabb2a8 NUTCH-2920 -- improve username/pw logic and update README.md
 new e03cad3f4 fix template to include new key store info.  remove unused 
auth
 new e8fd21090 Add indexer-opensearch-1x to 4 more targets...feedback from 
sebastian-nagel

The 7 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 LICENSE-binary |   1 +
 NOTICE-binary  |   4 +
 build.xml  |   3 +
 conf/index-writers.xml.template|  29 ++
 src/plugin/build.xml   |   2 +
 src/plugin/indexer-opensearch-1x/README.md |  66 +++
 .../{any23 => indexer-opensearch-1x}/build-ivy.xml |   2 +-
 .../build.xml  |   2 +-
 .../ivy.xml|   2 +-
 src/plugin/indexer-opensearch-1x/plugin.xml|  76 
 .../opensearch1x/OpenSearch1xConstants.java}   |  12 +-
 .../opensearch1x/OpenSearch1xIndexWriter.java  | 472 +
 .../indexwriter/opensearch1x}/package-info.java|   4 +-
 13 files changed, 666 insertions(+), 9 deletions(-)
 create mode 100644 src/plugin/indexer-opensearch-1x/README.md
 copy src/plugin/{any23 => indexer-opensearch-1x}/build-ivy.xml (95%)
 copy src/plugin/{indexer-elastic => indexer-opensearch-1x}/build.xml (94%)
 copy src/plugin/{indexer-elastic => indexer-opensearch-1x}/ivy.xml (94%)
 create mode 100644 src/plugin/indexer-opensearch-1x/plugin.xml
 copy 
src/plugin/{indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticConstants.java
 => 
indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xConstants.java}
 (76%)
 create mode 100644 
src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java
 copy src/{java/org/apache/nutch/parse => 
plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x}/package-info.java
 (86%)



[nutch] 03/07: NUTCH-2920 -- add keystore for 2-way tls; add back in no-tls option with a stern warning and possibly helpful links.

2023-03-06 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit f6b17177ad6049b5642d9510cb60fe0a1d3b5f1c
Author: tallison 
AuthorDate: Wed Mar 1 12:16:17 2023 -0500

NUTCH-2920 -- add keystore for 2-way tls; add back in no-tls option with a 
stern warning and possibly helpful links.
---
 .../opensearch1x/OpenSearch1xConstants.java|   6 +-
 .../opensearch1x/OpenSearch1xIndexWriter.java  | 137 +++--
 2 files changed, 99 insertions(+), 44 deletions(-)

diff --git 
a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xConstants.java
 
b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xConstants.java
index 8ca5038dd..cb172bda2 100644
--- 
a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xConstants.java
+++ 
b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xConstants.java
@@ -20,14 +20,14 @@ public interface OpenSearch1xConstants {
   String HOSTS = "host";
   String PORT = "port";
   String SCHEME = "scheme";
-
   String USER = "username";
   String PASSWORD = "password";
-  String USE_AUTH = "auth";
-
   String TRUST_STORE_PATH = "trust.store.path";
   String TRUST_STORE_PASSWORD = "trust.store.password";
   String TRUST_STORE_TYPE = "trust.store.type";
+  String KEY_STORE_PATH = "key.store.path";
+  String KEY_STORE_PASSWORD = "key.store.password";
+  String KEY_STORE_TYPE = "key.store.type";
   String INDEX = "index";
   String MAX_BULK_DOCS = "max.bulk.docs";
   String MAX_BULK_LENGTH = "max.bulk.size";
diff --git 
a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java
 
b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java
index e796a69e4..a121f15a2 100644
--- 
a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java
+++ 
b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java
@@ -22,6 +22,7 @@ import org.apache.http.HttpHost;
 import org.apache.http.auth.AuthScope;
 import org.apache.http.auth.UsernamePasswordCredentials;
 import org.apache.http.client.CredentialsProvider;
+import org.apache.http.conn.ssl.TrustSelfSignedStrategy;
 import org.apache.http.impl.client.BasicCredentialsProvider;
 import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
 import org.apache.http.ssl.SSLContextBuilder;
@@ -52,18 +53,20 @@ import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
 import javax.net.ssl.SSLContext;
-import java.io.File;
 import java.io.IOException;
 import java.io.InputStream;
 import java.lang.invoke.MethodHandles;
 import java.nio.file.Files;
 import java.nio.file.Paths;
+import java.security.GeneralSecurityException;
 import java.security.KeyStore;
 import java.time.format.DateTimeFormatter;
+
 import java.util.AbstractMap;
 import java.util.LinkedHashMap;
 import java.util.List;
 import java.util.Map;
+import java.util.Optional;
 import java.util.concurrent.TimeUnit;
 
 /**
@@ -80,18 +83,19 @@ public class OpenSearch1xIndexWriter implements IndexWriter 
{
   private static final int DEFAULT_EXP_BACKOFF_RETRIES = 10;
   private static final int DEFAULT_BULK_CLOSE_TIMEOUT = 600;
   private static final String DEFAULT_INDEX = "nutch";
-  private static final String DEFAULT_USER = "elastic";
-
+  private static final String DEFAULT_USER = "admin";
+  private static final String DEFAULT_PASSWORD = "admin";
   private String[] hosts;
   private int port;
-  private String scheme = HttpHost.DEFAULT_SCHEME_NAME;
-  private String user = null;
-  private String password = null;
-  private boolean auth;
-
+  private String scheme = "https";
+  private String user;
+  private String password;
   private String trustStorePath;
   private String trustStorePassword;
   private String trustStoreType;
+  private String keyStorePath;
+  private String keyStorePassword;
+  private String keyStoreType;
   private int maxBulkDocs;
   private int maxBulkLength;
   private int expBackoffMillis;
@@ -105,6 +109,7 @@ public class OpenSearch1xIndexWriter implements IndexWriter 
{
 
   private Configuration config;
 
+
   @Override
   public void open(Configuration conf, String name) throws IOException {
 // Implementation not required
@@ -125,7 +130,7 @@ public class OpenSearch1xIndexWriter implements IndexWriter 
{
 String hosts = parameters.get(OpenSearch1xConstants.HOSTS);
 
 if (StringUtils.isBlank(hosts)) {
-  String message = &

[nutch] 04/07: NUTCH-2920 -- improve handling for missing trust.store.path in the index-writers.xml

2023-03-06 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 5fc2839c447a1b3695b4bcb507d428d32ff27281
Author: tallison 
AuthorDate: Wed Mar 1 13:28:07 2023 -0500

NUTCH-2920 -- improve handling for missing trust.store.path in the 
index-writers.xml
---
 .../nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java| 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git 
a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java
 
b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java
index a121f15a2..ec516e250 100644
--- 
a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java
+++ 
b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java
@@ -253,9 +253,6 @@ public class OpenSearch1xIndexWriter implements IndexWriter 
{
   }
 
   private SSLContext createSSLContext() throws GeneralSecurityException, 
IOException {
-if (trustStorePath == null && keyStorePath == null) {
-  return SSLContexts.createDefault();
-}
 
 SSLContextBuilder sslBuilder = SSLContexts.custom();
 Optional trustStore = loadStore(trustStorePath, 
trustStorePassword, trustStoreType);
@@ -283,8 +280,8 @@ public class OpenSearch1xIndexWriter implements IndexWriter 
{
 if (StringUtils.isAllBlank(storePath)) {
   return Optional.empty();
 }
-if (StringUtils.isAllBlank(storePassword)) {
-  throw new IllegalArgumentException("must include a password for store: " 
+ storePath);
+if (storePassword == null) {
+  throw new IllegalArgumentException("must include a non-null password for 
store: " + storePath);
 }
 
 KeyStore store = KeyStore.getInstance(storeType);



[nutch] 02/07: NUTCH-2920 -- fix imports

2023-03-06 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 6e149f4954a0b7b21120b8e1467a07a82c60e66e
Author: tallison 
AuthorDate: Fri Feb 24 15:22:16 2023 -0500

NUTCH-2920 -- fix imports
---
 .../apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java | 3 ---
 1 file changed, 3 deletions(-)

diff --git 
a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java
 
b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java
index c31fbf17d..e796a69e4 100644
--- 
a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java
+++ 
b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java
@@ -22,8 +22,6 @@ import org.apache.http.HttpHost;
 import org.apache.http.auth.AuthScope;
 import org.apache.http.auth.UsernamePasswordCredentials;
 import org.apache.http.client.CredentialsProvider;
-import org.apache.http.conn.ssl.NoopHostnameVerifier;
-import org.apache.http.conn.ssl.TrustSelfSignedStrategy;
 import org.apache.http.impl.client.BasicCredentialsProvider;
 import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
 import org.apache.http.ssl.SSLContextBuilder;
@@ -33,7 +31,6 @@ import org.apache.nutch.indexer.IndexWriterParams;
 import org.apache.nutch.indexer.NutchDocument;
 import org.apache.nutch.indexer.NutchField;
 import org.apache.nutch.util.StringUtil;
-import org.checkerframework.checker.units.qual.K;
 import org.opensearch.action.DocWriteRequest;
 import org.opensearch.action.bulk.BackoffPolicy;
 import org.opensearch.action.bulk.BulkProcessor;



[nutch] branch master updated: NUTCH-2980: Upgraded Selenium to 4.7.2 + HTMLUnit

2023-02-18 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 383aeca5d NUTCH-2980: Upgraded Selenium to 4.7.2 + HTMLUnit
383aeca5d is described below

commit 383aeca5d30342b29b6ee6e05f8f3052c62d7303
Author: Kamil Mroczek 
AuthorDate: Thu Jan 19 23:05:05 2023 -0500

NUTCH-2980: Upgraded Selenium to 4.7.2 + HTMLUnit

- Removed phantomJS dependency as it wasn't being used and the project has 
been archived since 2018 - it was causing problems casting TakeScreenshot to 
HtmlUnitWebDriver
- Improved README setup instructions for IntelliJ
---
 README.md  |  44 -
 src/plugin/lib-htmlunit/ivy.xml|  12 +-
 src/plugin/lib-htmlunit/plugin.xml | 214 -
 src/plugin/lib-selenium/ivy.xml|   7 +-
 src/plugin/lib-selenium/plugin.xml | 170 ++--
 .../nutch/protocol/selenium/HttpWebClient.java |  28 ---
 .../handlers/DefaultClickAllAjaxLinksHandler.java  |   7 +-
 7 files changed, 361 insertions(+), 121 deletions(-)

diff --git a/README.md b/README.md
index a0ab67bd1..ffd04ae22 100644
--- a/README.md
+++ b/README.md
@@ -40,6 +40,8 @@ To contribute a patch, follow these instructions (note that 
installing
 IDE setup
 =
 
+### Eclipse
+
 Generate Eclipse project files
 
 ```
@@ -48,13 +50,45 @@ ant eclipse
 
 and follow the instructions in [Importing existing 
projects](https://help.eclipse.org/2019-06/topic/org.eclipse.platform.doc.user/tasks/tasks-importproject.htm).
 
-For Intellij IDEA, first install the [IvyIDEA 
Plugin](https://plugins.jetbrains.com/plugin/3612-ivyidea). then run ```ant 
eclipse```. 
+You must [configure the 
nutch-site.xml](https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse)
 before running. Make sure, you've added ```http.agent.name``` and 
```plugin.folders``` properties. The plugin.folders normally points to 
```/build/plugins```.
+
+Now create a Java Application Configuration, choose 
org.apache.nutch.crawl.Injector, add two paths as arguments. First one is the 
crawldb directory, second one is the URL directory where, the injector can read 
urls. Now run your configuration.
 
-Then open the project in IntelliJ. You may see popups like "Ant build scripts 
found", "Frameworks detected - IvyIDEA Framework detected". Just follow the 
simple steps in these dialogs.  
+If we still see the ```No plugins found on paths of property 
plugin.folders="plugins"```, update the plugin.folders in the 
nutch-default.xml, this is a quick fix, but should not be used.
 
-You must [configure the 
nutch-site.xml](https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse)
 before running. Make sure, you've added ```http.agent.name``` and 
```plugin.folders``` properties. The plugin.folders normally points to 
```/build/plugins```. 
 
-Now create a Java Application Configuration, choose 
org.apache.nutch.crawl.Injector, add two paths as arguments. First one is the 
crawldb directory, second one is the URL directory where, the injector can read 
urls. Now run your configuration. 
+### Intellij IDEA
 
-If we still see the ```No plugins found on paths of property 
plugin.folders="plugins"```, update the plugin.folders in the 
nutch-default.xml, this is a quick fix, but should not be used.
+First install the [IvyIDEA 
Plugin](https://plugins.jetbrains.com/plugin/3612-ivyidea). then run ```ant 
eclipse```. This will create the necessary
+.classpath and .project files so that Intellij can import the project in the 
next step.
+
+In Intellij IDEA, select File > New > Project from Existing Sources. Select 
the nutch home directory and click "Open".
+
+On the "Import Project" screen select the "Import project from external model" 
radio button and select "Eclipse".
+Click "Create". On the next screen the "Eclipse projects directory" should be 
already set to the nutch folder.
+Leave the "Create module files near .classpath files" radio button selected.
+Click "Next" on the next screens. On the project SDK screen select Java 11 and 
click "Create".
+
+Once the project is imported, you will see a popup saying "Ant build scripts 
found", "Frameworks detected - IvyIDEA Framework detected". Click "Import".
+If you don't get the pop-up, I'd suggest going through the steps again as this 
happens from time to time. There is another
+Ant popup that asks you to configure the project. Do NOT click "Configure".
+
+To import the code-style, Go to Intellij IDEA > Preferences > Editor > Code 
Style > Java.
+
+For the Scheme dropdown select "Project". Click 

[nutch] branch master updated: NUTCH-2974 Ant build fails with "Unparseable date" on certain platforms

2023-02-17 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 541e6936d NUTCH-2974 Ant build fails with "Unparseable date" on 
certain platforms
 new 19dbe7866 Merge pull request #752 from sebastian-nagel/NUTCH-2974
541e6936d is described below

commit 541e6936dfb1a07fe4c915b8b95c6b5cfdf2aeb0
Author: Sebastian Nagel 
AuthorDate: Mon Jan 16 14:22:20 2023 +0100

NUTCH-2974 Ant build fails with "Unparseable date" on certain platforms
---
 build.xml | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/build.xml b/build.xml
index 004a12191..594fabc24 100644
--- a/build.xml
+++ b/build.xml
@@ -102,7 +102,12 @@
 
 
 
-
+
+  
   
 
 



[nutch] branch master updated: NUTCH-2634 Some links marked as "nofollow" are followed anyway - fix detection of nofollow in multi-valued rel attributes

2023-01-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new dfdd00f31 NUTCH-2634 Some links marked as "nofollow" are followed 
anyway - fix detection of nofollow in multi-valued rel attributes
 new 9a1ed4015 Merge pull request #751 from sebastian-nagel/NUTCH-2634
dfdd00f31 is described below

commit dfdd00f3189839b6ed7d60651e5daa33f0038265
Author: Sebastian Nagel 
AuthorDate: Thu Jan 5 22:53:00 2023 +0100

NUTCH-2634 Some links marked as "nofollow" are followed anyway
- fix detection of nofollow in multi-valued rel attributes
---
 .../org/apache/nutch/parse/html/DOMContentUtils.java   |  9 +++--
 .../apache/nutch/parse/html/TestDOMContentUtils.java   | 17 -
 .../org/apache/nutch/parse/tika/DOMContentUtils.java   |  6 +-
 .../apache/nutch/parse/tika/TestDOMContentUtils.java   | 18 --
 4 files changed, 36 insertions(+), 14 deletions(-)

diff --git 
a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
 
b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
index 2415e8568..76685675b 100644
--- 
a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
+++ 
b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
@@ -23,6 +23,7 @@ import java.util.ArrayList;
 import java.util.HashMap;
 import java.util.HashSet;
 import java.util.Set;
+import java.util.regex.Pattern;
 
 import org.apache.nutch.parse.Outlink;
 import org.apache.nutch.util.NodeWalker;
@@ -30,6 +31,7 @@ import org.apache.nutch.util.URLUtil;
 import org.w3c.dom.NamedNodeMap;
 import org.w3c.dom.Node;
 import org.w3c.dom.NodeList;
+
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.MapWritable;
 import org.apache.hadoop.io.Text;
@@ -42,7 +44,10 @@ import org.apache.hadoop.io.Text;
  * 
  */
 public class DOMContentUtils {
-  
+
+  private static Pattern NOFOLLOW_PATTERN = Pattern.compile("\\bnofollow\\b",
+  Pattern.CASE_INSENSITIVE);
+
   private String srcTagMetaName;
   private boolean keepNodenames;
   private Set blockNodes;
@@ -451,7 +456,7 @@ public class DOMContentUtils {
   if (params.attrName.equalsIgnoreCase(attrName)) {
 target = attr.getNodeValue();
   } else if ("rel".equalsIgnoreCase(attrName)
-  && "nofollow".equalsIgnoreCase(attr.getNodeValue())) {
+  && NOFOLLOW_PATTERN.matcher(attr.getNodeValue()).find()) {
 noFollow = true;
   } else if ("method".equalsIgnoreCase(attrName)
   && "post".equalsIgnoreCase(attr.getNodeValue())) {
diff --git 
a/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java
 
b/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java
index 0c1212a50..d50e9052d 100644
--- 
a/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java
+++ 
b/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java
@@ -103,6 +103,11 @@ public class TestDOMContentUtils {
   + "http://www.nutch.org\; rel=\"nofollow\"> ignore "
   + "http://www.nutch.org\;> ignore "
   + ""),
+  // multiple space-separated rel values (NUTCH-2634)
+  new String(""
+  + "http://www.nutch.org\; rel=\"noreferrer nofollow\"> 
ignore "
+  + "http://www.nutch.org\;> 
ignore "
+  + ""),
   // test that POST form actions are skipped
   new String(""
   + ""
@@ -132,13 +137,13 @@ public class TestDOMContentUtils {
   + ""
   + "" + ""), };
 
-  private static int SKIP = 9;
+  private static int SKIP = 10;
 
   private static String[] testBaseHrefs = { "http://www.nutch.org;,
   "http://www.nutch.org/docs/foo.html;, "http://www.nutch.org/docs/;,
   "http://www.nutch.org/docs/;, "http://www.nutch.org/frames/;,
   "http://www.nutch.org/maps/;, "http://www.nutch.org/whitespace/;,
-  "http://www.nutch.org//;, "http://www.nutch.org/;,
+  "http://www.nutch.org//;, "http://www.nutch.org//;, 
"http://www.nutch.org/;,
   "http://www.nutch.org/;, "http://www.nutch.org/;,
   "http://www.nutch.org/;something;, "http://www.nutch.org/;,
   "http://www.nutch.org/; };
@@ -159,12 +164,13 @@ public class TestDOMContentUtils {
   + "Tabs are spaces too. This is a break -> and the line after break 
. "
   + "

[nutch] branch master updated (85f7bcb63 -> ed7b6615b)

2022-09-11 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from 85f7bcb63 Prepare for new development after release of 1.19 - bump 
version number (-> 1.20-NAPSHOT)
 new 989c2ca8d NUTCH-2883 Provide means to run server and webapp as 
persistent services in Docker container
 new 0bda1bded NUTCH-2883 Provide means to run server and webapp as 
persistent services in Docker container - move ARG instructions into FROM block 
they're used in (duplicate if   necessary)
 new 7c1a48cfa NUTCH-2883 Provide means to run server and webapp as 
persistent services in Docker container - install Nutch WebApp from separate 
repository (see NUTCH-2886) and run   it via `mvn jetty:run 
-Djetty.port= - sync log paths in supervisord config files
 new ed7b6615b Merge pull request #748 from 
sebastian-nagel/NUTCH-2883-docker

The 3338 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../.dockerfilelintrc  |  4 +-
 docker/Dockerfile  | 88 --
 docker/README.md   | 71 ++---
 docker/config/supervisord_startserver.conf | 47 
 docker/config/supervisord_startserver_webapp.conf  | 69 +
 5 files changed, 263 insertions(+), 16 deletions(-)
 copy conf/domain-urlfilter.txt.template => docker/.dockerfilelintrc (94%)
 create mode 100644 docker/config/supervisord_startserver.conf
 create mode 100644 docker/config/supervisord_startserver_webapp.conf



svn commit: r56776 - /release/nutch/1.18/

2022-09-10 Thread snagel
Author: snagel
Date: Sat Sep 10 13:19:52 2022
New Revision: 56776

Log:
Remove 1.18 after release of 1.19

Removed:
release/nutch/1.18/



[nutch] 02/02: Prepare for new development after release of 1.19 - bump version number (-> 1.20-NAPSHOT)

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 85f7bcb63ee801bdfb0b41ca2555160583105ea2
Author: Sebastian Nagel 
AuthorDate: Thu Sep 8 16:28:27 2022 +0200

Prepare for new development after release of 1.19
- bump version number (-> 1.20-NAPSHOT)
---
 conf/nutch-default.xml | 2 +-
 default.properties | 2 +-
 src/bin/nutch  | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index a908bdb16..d05503d23 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -184,7 +184,7 @@
 
 
   http.agent.version
-  Nutch-1.19
+  Nutch-1.20-SNAPSHOT
   A version string to advertise in the User-Agent 
header.
 
diff --git a/default.properties b/default.properties
index 38a070e26..df96199c1 100644
--- a/default.properties
+++ b/default.properties
@@ -14,7 +14,7 @@
 # limitations under the License.
 
 name=apache-nutch
-version=1.19
+version=1.20-SNAPSHOT
 final.name=${name}-${version}
 year=2022
 
diff --git a/src/bin/nutch b/src/bin/nutch
index 3359c7be1..5b999fa6f 100755
--- a/src/bin/nutch
+++ b/src/bin/nutch
@@ -61,7 +61,7 @@ done
 
 # if no args specified, show usage
 if [ $# = 0 ]; then
-  echo "nutch 1.19"
+  echo "nutch 1.20-SNAPSHOT"
   echo "Usage: nutch COMMAND [-Dproperty=value]... [command-specific args]..."
   echo "where COMMAND is one of:"
   echo "  readdbread / dump crawl db"



[nutch] branch master updated (ffe059892 -> 85f7bcb63)

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from ffe059892 NUTCH-2969 Javadoc: Javascript search is not working when 
built on JDK 11 - pass --no-module-directories to javadoc target when building 
on JDK 11 - remove obsolete condition to fail javadoc builds on JDK 7u25 and 
earlier
 new 27cf929b8 Nutch 1.19 release - update current year in API docs etc. - 
update version number - add changes / release notes - update links to Hadoop 
API docs
 new 85f7bcb63 Prepare for new development after release of 1.19 - bump 
version number (-> 1.20-NAPSHOT)

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 CHANGES.txt| 110 -
 conf/nutch-default.xml |   2 +-
 default.properties |   9 ++--
 src/bin/nutch  |   2 +-
 4 files changed, 114 insertions(+), 9 deletions(-)



[nutch] 01/02: Nutch 1.19 release - update current year in API docs etc. - update version number - add changes / release notes - update links to Hadoop API docs

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 27cf929b83ba86b896762dd4970e445069e514ae
Author: Sebastian Nagel 
AuthorDate: Mon Aug 22 15:57:41 2022 +0200

Nutch 1.19 release
- update current year in API docs etc.
- update version number
- add changes / release notes
- update links to Hadoop API docs
---
 CHANGES.txt| 110 -
 conf/nutch-default.xml |   2 +-
 default.properties |   9 ++--
 src/bin/nutch  |   2 +-
 4 files changed, 114 insertions(+), 9 deletions(-)

diff --git a/CHANGES.txt b/CHANGES.txt
index 822bd4acf..adea4478f 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,9 +1,117 @@
 # Nutch Change Log
 
+Nutch 1.19 Release 22/08/2022 (dd/mm/)
+Release Report: https://s.apache.org/lf6li
+
 Breaking Changes
 
-- the plugin parse-swf for parsing Shockwave/Adobe Flash conent was 
removed (NUTCH-2861)
+- Nutch is built on JDK 11 (NUTCH-2857)
+- the Nutch WebApp was moved to a separate repository (NUTCH-2886)
+  see https://github.com/apache/nutch-webapp
+  https://gitbox.apache.org/repos/asf?p=nutch-webapp.git
+- the plugin parse-swf for parsing Shockwave/Adobe Flash content was 
removed (NUTCH-2861)
+
+Sub-task
+
+[NUTCH-2819] - Move spotbugs "installation" directory to avoid that 
spotbugs is shipped in Nutch runtime
+[NUTCH-2846] - Fix various bugs spotted by NUTCH-2815
+[NUTCH-2850] - Method ignores exceptional return value
+[NUTCH-2851] - Random object created and used only once
+[NUTCH-2855] - Update org.elasticsearch.client
+
+Bug
+
+[NUTCH-2290] - Update licenses of bundled libraries
+[NUTCH-2512] - Nutch does not build under JDK9
+[NUTCH-2821] - Deduplicate licenses in LICENSE.txt file
+[NUTCH-2822] - Split the LICENSE.txt file into two files for source resp. 
binary releases
+[NUTCH-2831] - Elastic indexer does not support SSL
+[NUTCH-2843] - Duplicate declaration of dependencies in ivy.xml
+[NUTCH-2858] - urlnormalizer-protocol: URL port is lost during 
normalization
+[NUTCH-2862] - Do not include Ivy jar in source release package
+[NUTCH-2863] - Injector to parse command-line flags case-insensitive
+[NUTCH-2866] - MetaData.toString() should return "key=value ..."
+[NUTCH-2868] - urlnormalizer-protocol fails with 
StringIndexOutOfBoundsException when reading invalid line in configuration file
+[NUTCH-2881] - bug in 'nutch' symlink in docker container
+[NUTCH-2889] - nutch indexer-elasticsearch plugin, doesn't work with https 
protocol
+[NUTCH-2890] - Protocol-okhttp: upgrade okhttp to 4.9.1 to address 
infinite connection retries
+[NUTCH-2894] - Java plugin compilation classpath: priorize plugin 
dependencies
+[NUTCH-2899] - Remove needless warning about missing 
o/a/rat/anttasks/antlib.xml
+[NUTCH-2902] - Jexl parsing error on statements
+[NUTCH-2905] - Mask sensitive strings in log output of index writers
+[NUTCH-2910] - FetchItemQueues overloaded constructor also interprets 
fetcher timeout as -1 e.g. no-timeout.
+[NUTCH-2915] - Upgrade to log4j 2.15.0
+[NUTCH-2916] - Fix log file rotation / rename default log file
+[NUTCH-2917] - Remove transitive dependency to log4j 1.x
+[NUTCH-2922] - Upgrade to log4j 2.17.0
+[NUTCH-2935] - DeduplicationJob: failure on URLs with invalid percent 
encoding
+[NUTCH-2936] - Early registration of URL stream handlers provided by 
plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is 
used
+[NUTCH-2945] - Solr Index Writer pluging schema.xml missing a copyToField
+[NUTCH-2947] - Fetcher: keep state of empty fetch queues unless queue 
feeder is finished
+[NUTCH-2949] - Tasks of a multi-threaded map runner may fail because of 
slow creation of URL stream handlers
+[NUTCH-2951] - Crawl datum with metadata WRITABLE_GENERATE_TIME_KEY awaits 
fetching forever
+[NUTCH-2955] - indexer-solr: replace deprecated/removed field type 
solr.LatLonType
+[NUTCH-2969] - Javadoc: Javascript search is not working when built on JDK 
11
+
+New Feature
+
+[NUTCH-2901] - migrate to maven or gradle
+
+Improvement
+
+[NUTCH-1403] - Add default ScoringFilter for manipulating metadata
+[NUTCH-2429] - Fix Plugin System to allow protocol plugins to bundle their 
URLStreamHandlers
+[NUTCH-2449] - Usage of Tika LanguageIdentifier in language-identifier 
plugin
+[NUTCH-2573] - Suspend crawling if robots.txt fails to fetch with 5xx 
status
+[NUTCH-2795] - CrawlDbReader: compress CrawlDb dumps if configured
+[NUTCH-2807] - SitemapProcessor to warn that ignoring robots.txt affects 
detection of sitemaps
+[NUTCH-2808] - Document side effects of ignoring robots.txt
+[NUTCH-2840] - Fix 'report-vulnerabilities' ant target in b

[nutch-site] 02/02: Announce release of Nutch 1.19 - fix release data in announcement

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/nutch-site.git

commit aa45c17bf678c601f4f691dfbdca77380aea5edd
Author: Sebastian Nagel 
AuthorDate: Thu Sep 8 15:25:32 2022 +0200

Announce release of Nutch 1.19
- fix release data in announcement
---
 content/news/nutch-1.19-release.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/news/nutch-1.19-release.md 
b/content/news/nutch-1.19-release.md
index 8a1d135..774345b 100644
--- a/content/news/nutch-1.19-release.md
+++ b/content/news/nutch-1.19-release.md
@@ -1,5 +1,5 @@
 +++
-date = "2021-09-08"
+date = "2022-08-22"
 title = "Nutch 1.19 Release"
 tags = ["1.19","release"]
 categories = ["releases"]



[nutch-site] branch main updated (4efc5a9 -> aa45c17)

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch main
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


from 4efc5a9  NUTCH-1999 Add /robots.txt to Nutch site (#1)
 new 73e90d4  Announce release of Nutch 1.19
 new aa45c17  Announce release of Nutch 1.19 - fix release data in 
announcement

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 content/doap.rdf   | 7 +
 .../javadoc/apidocs/allclasses-frame.html  |   549 -
 .../javadoc/apidocs/allclasses-index.html  |  2845 +
 .../{allclasses-noframe.html => allclasses.html}   |83 +-
 .../javadoc/apidocs/allpackages-index.html |   826 ++
 .../javadoc/apidocs/constant-values.html   |  2415 ++--
 .../javadoc/apidocs/deprecated-list.html   |   155 +-
 .../javadoc/apidocs/{package-list => element-list} |18 +-
 .../documentation/javadoc/apidocs/help-doc.html|   169 +-
 .../documentation/javadoc/apidocs/index-all.html   |  7425 ++--
 content/documentation/javadoc/apidocs/index.html   |   885 +-
 .../apidocs/jquery/external/jquery/jquery.js   | 10872 ++
 .../jquery/images/ui-bg_glass_55_fbf9ee_1x400.png  |   Bin 0 -> 335 bytes
 .../jquery/images/ui-bg_glass_65_dadada_1x400.png  |   Bin 0 -> 262 bytes
 .../jquery/images/ui-bg_glass_75_dadada_1x400.png  |   Bin 0 -> 262 bytes
 .../jquery/images/ui-bg_glass_75_e6e6e6_1x400.png  |   Bin 0 -> 262 bytes
 .../jquery/images/ui-bg_glass_95_fef1ec_1x400.png  |   Bin 0 -> 332 bytes
 .../ui-bg_highlight-soft_75_cc_1x100.png   |   Bin 0 -> 280 bytes
 .../jquery/images/ui-icons_22_256x240.png  |   Bin 0 -> 6922 bytes
 .../jquery/images/ui-icons_2e83ff_256x240.png  |   Bin 0 -> 4549 bytes
 .../jquery/images/ui-icons_454545_256x240.png  |   Bin 0 -> 6992 bytes
 .../jquery/images/ui-icons_88_256x240.png  |   Bin 0 -> 6999 bytes
 .../jquery/images/ui-icons_cd0a0a_256x240.png  |   Bin 0 -> 4549 bytes
 .../javadoc/apidocs/jquery/jquery-3.5.1.js | 10872 ++
 .../javadoc/apidocs/jquery/jquery-ui.css   |   582 +
 .../javadoc/apidocs/jquery/jquery-ui.js|  2659 +
 .../javadoc/apidocs/jquery/jquery-ui.min.css   | 7 +
 .../javadoc/apidocs/jquery/jquery-ui.min.js| 6 +
 .../javadoc/apidocs/jquery/jquery-ui.structure.css |   156 +
 .../apidocs/jquery/jquery-ui.structure.min.css | 5 +
 .../jquery/jszip-utils/dist/jszip-utils-ie.js  |56 +
 .../jquery/jszip-utils/dist/jszip-utils-ie.min.js  |10 +
 .../apidocs/jquery/jszip-utils/dist/jszip-utils.js |   118 +
 .../jquery/jszip-utils/dist/jszip-utils.min.js |10 +
 .../javadoc/apidocs/jquery/jszip/dist/jszip.js | 11370 +++
 .../javadoc/apidocs/jquery/jszip/dist/jszip.min.js |13 +
 .../javadoc/apidocs/member-search-index.js | 1 +
 .../javadoc/apidocs/member-search-index.zip|   Bin 0 -> 40331 bytes
 .../nutch/analysis/lang/HTMLLanguageParser.html|   208 +-
 .../analysis/lang/LanguageIndexingFilter.html  |   211 +-
 .../lang/class-use/HTMLLanguageParser.html |94 +-
 .../lang/class-use/LanguageIndexingFilter.html |94 +-
 .../apache/nutch/analysis/lang/package-frame.html  |21 -
 .../nutch/analysis/lang/package-summary.html   |   116 +-
 .../apache/nutch/analysis/lang/package-tree.html   |98 +-
 .../apache/nutch/analysis/lang/package-use.html|90 +-
 .../apache/nutch/any23/Any23IndexingFilter.html|   257 +-
 .../org/apache/nutch/any23/Any23ParseFilter.html   |   266 +-
 .../nutch/any23/class-use/Any23IndexingFilter.html |94 +-
 .../nutch/any23/class-use/Any23ParseFilter.html|94 +-
 .../org/apache/nutch/any23/package-frame.html  |21 -
 .../org/apache/nutch/any23/package-summary.html|   124 +-
 .../org/apache/nutch/any23/package-tree.html   |98 +-
 .../org/apache/nutch/any23/package-use.html|90 +-
 .../apache/nutch/collection/CollectionManager.html |   272 +-
 .../org/apache/nutch/collection/Subcollection.html |   417 +-
 .../collection/class-use/CollectionManager.html|   121 +-
 .../nutch/collection/class-use/Subcollection.html  |   142 +-
 .../org/apache/nutch/collection/package-frame.html |21 -
 .../apache/nutch/collection/package-summary.html   |   170 +-
 .../org/apache/nutch/collection/package-tree.html  |   100 +-
 .../org/apache/nutch/collection/package-use.html   |   114 +-
 .../apache/nutch/crawl/AbstractFetchSchedule.html  |   321 +-
 .../apache/nutch/crawl/AdaptiveFetchSchedule.html  |   248 +-
 .../apache/nutch/crawl/CrawlDatum.Comparator.html  |   167 +-
 .../apidocs/

[nutch-site] branch asf-site updated: Announce release of Nutch 1.19 - fix release data in announcement

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 956a142  Announce release of Nutch 1.19 - fix release data in 
announcement
956a142 is described below

commit 956a1425b97c07e2e7469296d810d28f70667a50
Author: Sebastian Nagel 
AuthorDate: Thu Sep 8 15:26:59 2022 +0200

Announce release of Nutch 1.19
- fix release data in announcement
---
 content/categories/index.xml   |  4 ++--
 content/categories/releases/index.xml  |  4 ++--
 content/index.xml  |  4 ++--
 content/news/index.xml |  4 ++--
 content/news/nutch-1.19-release/index.html |  4 ++--
 content/sitemap.xml| 16 
 content/tags/1.19/index.xml|  4 ++--
 content/tags/index.xml |  6 +++---
 content/tags/release/index.xml |  4 ++--
 9 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/content/categories/index.xml b/content/categories/index.xml
index cb64e1f..9f537a6 100644
--- a/content/categories/index.xml
+++ b/content/categories/index.xml
@@ -6,11 +6,11 @@
 Recent content in Categories on Apache Nutch™
 Hugo -- gohugo.io
 en-us
-Wed, 08 Sep 2021 00:00:00 +
+Mon, 22 Aug 2022 00:00:00 +
 
   releases
   /categories/releases/
-  Wed, 08 Sep 2021 00:00:00 +
+  Mon, 22 Aug 2022 00:00:00 +
   
   /categories/releases/
   
diff --git a/content/categories/releases/index.xml 
b/content/categories/releases/index.xml
index 5ea4121..80e9797 100644
--- a/content/categories/releases/index.xml
+++ b/content/categories/releases/index.xml
@@ -6,11 +6,11 @@
 Recent content in releases on Apache Nutch™
 Hugo -- gohugo.io
 en-us
-Wed, 08 Sep 2021 00:00:00 +
+Mon, 22 Aug 2022 00:00:00 +
 
   Nutch 1.19 Release
   /news/nutch-1.19-release/
-  Wed, 08 Sep 2021 00:00:00 +
+  Mon, 22 Aug 2022 00:00:00 +
   
   /news/nutch-1.19-release/
   The Apache Nutch PMC are pleased to announce the immediate 
release of Apache Nutch v1.19, we advise all current users and developers of 
the 1.X series to upgrade to this release.
diff --git a/content/index.xml b/content/index.xml
index 8a0e2c4..f2fd700 100644
--- a/content/index.xml
+++ b/content/index.xml
@@ -6,11 +6,11 @@
 Recent content on Apache Nutch™
 Hugo -- gohugo.io
 en-us
-Wed, 08 Sep 2021 00:00:00 +
+Mon, 22 Aug 2022 00:00:00 +
 
   Nutch 1.19 Release
   /news/nutch-1.19-release/
-  Wed, 08 Sep 2021 00:00:00 +
+  Mon, 22 Aug 2022 00:00:00 +
   
   /news/nutch-1.19-release/
   The Apache Nutch PMC are pleased to announce the immediate 
release of Apache Nutch v1.19, we advise all current users and developers of 
the 1.X series to upgrade to this release.
diff --git a/content/news/index.xml b/content/news/index.xml
index ff055e8..d129c73 100644
--- a/content/news/index.xml
+++ b/content/news/index.xml
@@ -6,11 +6,11 @@
 Recent content in Project News on Apache Nutch™
 Hugo -- gohugo.io
 en-us
-Wed, 08 Sep 2021 00:00:00 +
+Mon, 22 Aug 2022 00:00:00 +
 
   Nutch 1.19 Release
   /news/nutch-1.19-release/
-  Wed, 08 Sep 2021 00:00:00 +
+  Mon, 22 Aug 2022 00:00:00 +
   
   /news/nutch-1.19-release/
   The Apache Nutch PMC are pleased to announce the immediate 
release of Apache Nutch v1.19, we advise all current users and developers of 
the 1.X series to upgrade to this release.
diff --git a/content/news/nutch-1.19-release/index.html 
b/content/news/nutch-1.19-release/index.html
index 10c420b..fb55985 100644
--- a/content/news/nutch-1.19-release/index.html
+++ b/content/news/nutch-1.19-release/index.html
@@ -25,8 +25,8 @@
 
 
 
-
-
+
+
 
 
   
diff --git a/content/sitemap.xml b/content/sitemap.xml
index e3fa139..5570667 100644
--- a/content/sitemap.xml
+++ b/content/sitemap.xml
@@ -5,7 +5,7 @@
   
 
   https://nutch.apache.org/news/nutch-1.19-release/
-  2021-09-08T00:00:00+00:00
+  2022-08-22T00:00:00+00:00
 
 
   https://nutch.apache.org/news/nutch-1.18-release/
@@ -65,31 +65,31 @@
 
 
   https://nutch.apache.org/tags/1.19/
-  2021-09-08T00:00:00+00:00
+  2022-08-22T00:00:00+00:00
 
 
   https://nutch.apache.org/
-  2021-09-08T00:00:00+00:00
+  2022-08-22T00:00:00+00:00
 
 
   https://nutch.apache.org/categories/
-  2021-09-08T00:00:00+00:00
+  2022-08-22T00:00:00+00:00
 
 
   https://nutch.apache.org/news/
-  2021-09-08T00:00:00+00:00
+  2022-08-22T00:00:00+00:00
 
 
   https://nutch.apache.org/tags/release/
-  2021-09-08T00:00:00+00:00
+  2022-08-22T00:00:00+00:00

[nutch-site] branch asf-staging updated: Announce release of Nutch 1.19 - fix release data in announcement

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/asf-staging by this push:
 new 87176ac  Announce release of Nutch 1.19 - fix release data in 
announcement
87176ac is described below

commit 87176ac53fa8fc604abebf23fabf4ad5a77bfd6b
Author: Sebastian Nagel 
AuthorDate: Thu Sep 8 15:26:59 2022 +0200

Announce release of Nutch 1.19
- fix release data in announcement
---
 content/categories/index.xml   |  4 ++--
 content/categories/releases/index.xml  |  4 ++--
 content/index.xml  |  4 ++--
 content/news/index.xml |  4 ++--
 content/news/nutch-1.19-release/index.html |  4 ++--
 content/sitemap.xml| 16 
 content/tags/1.19/index.xml|  4 ++--
 content/tags/index.xml |  6 +++---
 content/tags/release/index.xml |  4 ++--
 9 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/content/categories/index.xml b/content/categories/index.xml
index cb64e1f..9f537a6 100644
--- a/content/categories/index.xml
+++ b/content/categories/index.xml
@@ -6,11 +6,11 @@
 Recent content in Categories on Apache Nutch™
 Hugo -- gohugo.io
 en-us
-Wed, 08 Sep 2021 00:00:00 +
+Mon, 22 Aug 2022 00:00:00 +
 
   releases
   /categories/releases/
-  Wed, 08 Sep 2021 00:00:00 +
+  Mon, 22 Aug 2022 00:00:00 +
   
   /categories/releases/
   
diff --git a/content/categories/releases/index.xml 
b/content/categories/releases/index.xml
index 5ea4121..80e9797 100644
--- a/content/categories/releases/index.xml
+++ b/content/categories/releases/index.xml
@@ -6,11 +6,11 @@
 Recent content in releases on Apache Nutch™
 Hugo -- gohugo.io
 en-us
-Wed, 08 Sep 2021 00:00:00 +
+Mon, 22 Aug 2022 00:00:00 +
 
   Nutch 1.19 Release
   /news/nutch-1.19-release/
-  Wed, 08 Sep 2021 00:00:00 +
+  Mon, 22 Aug 2022 00:00:00 +
   
   /news/nutch-1.19-release/
   The Apache Nutch PMC are pleased to announce the immediate 
release of Apache Nutch v1.19, we advise all current users and developers of 
the 1.X series to upgrade to this release.
diff --git a/content/index.xml b/content/index.xml
index 8a0e2c4..f2fd700 100644
--- a/content/index.xml
+++ b/content/index.xml
@@ -6,11 +6,11 @@
 Recent content on Apache Nutch™
 Hugo -- gohugo.io
 en-us
-Wed, 08 Sep 2021 00:00:00 +
+Mon, 22 Aug 2022 00:00:00 +
 
   Nutch 1.19 Release
   /news/nutch-1.19-release/
-  Wed, 08 Sep 2021 00:00:00 +
+  Mon, 22 Aug 2022 00:00:00 +
   
   /news/nutch-1.19-release/
   The Apache Nutch PMC are pleased to announce the immediate 
release of Apache Nutch v1.19, we advise all current users and developers of 
the 1.X series to upgrade to this release.
diff --git a/content/news/index.xml b/content/news/index.xml
index ff055e8..d129c73 100644
--- a/content/news/index.xml
+++ b/content/news/index.xml
@@ -6,11 +6,11 @@
 Recent content in Project News on Apache Nutch™
 Hugo -- gohugo.io
 en-us
-Wed, 08 Sep 2021 00:00:00 +
+Mon, 22 Aug 2022 00:00:00 +
 
   Nutch 1.19 Release
   /news/nutch-1.19-release/
-  Wed, 08 Sep 2021 00:00:00 +
+  Mon, 22 Aug 2022 00:00:00 +
   
   /news/nutch-1.19-release/
   The Apache Nutch PMC are pleased to announce the immediate 
release of Apache Nutch v1.19, we advise all current users and developers of 
the 1.X series to upgrade to this release.
diff --git a/content/news/nutch-1.19-release/index.html 
b/content/news/nutch-1.19-release/index.html
index 10c420b..fb55985 100644
--- a/content/news/nutch-1.19-release/index.html
+++ b/content/news/nutch-1.19-release/index.html
@@ -25,8 +25,8 @@
 
 
 
-
-
+
+
 
 
   
diff --git a/content/sitemap.xml b/content/sitemap.xml
index e3fa139..5570667 100644
--- a/content/sitemap.xml
+++ b/content/sitemap.xml
@@ -5,7 +5,7 @@
   
 
   https://nutch.apache.org/news/nutch-1.19-release/
-  2021-09-08T00:00:00+00:00
+  2022-08-22T00:00:00+00:00
 
 
   https://nutch.apache.org/news/nutch-1.18-release/
@@ -65,31 +65,31 @@
 
 
   https://nutch.apache.org/tags/1.19/
-  2021-09-08T00:00:00+00:00
+  2022-08-22T00:00:00+00:00
 
 
   https://nutch.apache.org/
-  2021-09-08T00:00:00+00:00
+  2022-08-22T00:00:00+00:00
 
 
   https://nutch.apache.org/categories/
-  2021-09-08T00:00:00+00:00
+  2022-08-22T00:00:00+00:00
 
 
   https://nutch.apache.org/news/
-  2021-09-08T00:00:00+00:00
+  2022-08-22T00:00:00+00:00
 
 
   https://nutch.apache.org/tags/release/
-  2021-09-08T00:00:00+00:00
+  2022-08-22T00:00:00+00

[nutch-site] branch asf-site updated (a41c7ef -> 314b1b2)

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch asf-site
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


from a41c7ef  Add doap.rdf (lost during CMS migration)
 new 1e7bf4e  - add README for branch asf-site - modify .asf.yaml to 
contain only instructions required in branch   asf-site
 new 45468fc  Update content from Hugo build after adding Kube modified 
templates
 new 314b1b2  Announce release of Nutch 1.19

The 3 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .asf.yaml  |17 +-
 README.md  | 7 +
 content/apache/index.html  |11 +-
 content/categories/index.html  |16 +-
 content/categories/index.xml   | 4 +-
 content/categories/news/index.html |16 +-
 content/categories/releases/index.html |25 +-
 content/categories/releases/index.xml  |13 +-
 content/community/board-reporting/index.html   |11 +-
 content/community/bot/index.html   |11 +-
 content/community/contributing/index.html  |11 +-
 content/community/index.html   |14 +-
 content/community/index.xml| 4 +-
 content/community/mailing-lists/index.html |11 +-
 content/community/merchandise/index.html   |11 +-
 content/community/people-credits/index.html|11 +-
 content/development/index.html |14 +-
 content/development/index.xml  | 4 +-
 content/development/issue-tracker/index.html   |11 +-
 content/development/nightly-builds/index.html  |11 +-
 .../development/source-code-management/index.html  |11 +-
 content/doap.rdf   | 9 +-
 content/documentation/about/index.html |11 +-
 content/documentation/faqs/index.html  |11 +-
 content/documentation/index.html   |14 +-
 content/documentation/index.xml| 4 +-
 .../javadoc/apidocs/allclasses-frame.html  |   549 -
 .../javadoc/apidocs/allclasses-index.html  |  2845 +
 .../{allclasses-noframe.html => allclasses.html}   |83 +-
 .../javadoc/apidocs/allpackages-index.html |   826 ++
 .../javadoc/apidocs/constant-values.html   |  2415 ++--
 .../javadoc/apidocs/deprecated-list.html   |   155 +-
 .../javadoc/apidocs/{package-list => element-list} |18 +-
 .../documentation/javadoc/apidocs/help-doc.html|   169 +-
 .../documentation/javadoc/apidocs/index-all.html   |  7425 ++--
 content/documentation/javadoc/apidocs/index.html   |   885 +-
 .../apidocs/jquery/external/jquery/jquery.js   | 10872 ++
 .../jquery/images/ui-bg_glass_55_fbf9ee_1x400.png  |   Bin 0 -> 335 bytes
 .../jquery/images/ui-bg_glass_65_dadada_1x400.png  |   Bin 0 -> 262 bytes
 .../jquery/images/ui-bg_glass_75_dadada_1x400.png  |   Bin 0 -> 262 bytes
 .../jquery/images/ui-bg_glass_75_e6e6e6_1x400.png  |   Bin 0 -> 262 bytes
 .../jquery/images/ui-bg_glass_95_fef1ec_1x400.png  |   Bin 0 -> 332 bytes
 .../ui-bg_highlight-soft_75_cc_1x100.png   |   Bin 0 -> 280 bytes
 .../jquery/images/ui-icons_22_256x240.png  |   Bin 0 -> 6922 bytes
 .../jquery/images/ui-icons_2e83ff_256x240.png  |   Bin 0 -> 4549 bytes
 .../jquery/images/ui-icons_454545_256x240.png  |   Bin 0 -> 6992 bytes
 .../jquery/images/ui-icons_88_256x240.png  |   Bin 0 -> 6999 bytes
 .../jquery/images/ui-icons_cd0a0a_256x240.png  |   Bin 0 -> 4549 bytes
 .../javadoc/apidocs/jquery/jquery-3.5.1.js | 10872 ++
 .../javadoc/apidocs/jquery/jquery-ui.css   |   582 +
 .../javadoc/apidocs/jquery/jquery-ui.js|  2659 +
 .../javadoc/apidocs/jquery/jquery-ui.min.css   | 7 +
 .../javadoc/apidocs/jquery/jquery-ui.min.js| 6 +
 .../javadoc/apidocs/jquery/jquery-ui.structure.css |   156 +
 .../apidocs/jquery/jquery-ui.structure.min.css | 5 +
 .../jquery/jszip-utils/dist/jszip-utils-ie.js  |56 +
 .../jquery/jszip-utils/dist/jszip-utils-ie.min.js  |10 +
 .../apidocs/jquery/jszip-utils/dist/jszip-utils.js |   118 +
 .../jquery/jszip-utils/dist/jszip-utils.min.js |10 +
 .../javadoc/apidocs/jquery/jszip/dist/jszip.js | 11370 +++
 .../javadoc/apidocs/jquery/jszip/dist/jszip.min.js |13 +
 .../javadoc/apidocs/member-search-index.js | 1 +
 .../javadoc/apidocs/member-search-index.zip|   Bin 0 -> 40331 byt

[nutch-site] 02/03: Update content from Hugo build after adding Kube modified templates

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/nutch-site.git

commit 45468fc2c2e83cfe1aef57f437ca02991c0256b3
Author: Sebastian Nagel 
AuthorDate: Thu Sep 8 10:54:37 2022 +0200

Update content from Hugo build after adding Kube modified templates
---
 content/apache/index.html  |  11 ++--
 content/categories/index.html  |  16 ++---
 content/categories/news/index.html |  16 ++---
 content/categories/releases/index.html |  16 ++---
 content/community/board-reporting/index.html   |  11 ++--
 content/community/bot/index.html   |  11 ++--
 content/community/contributing/index.html  |  11 ++--
 content/community/index.html   |  14 ++---
 content/community/index.xml|   4 +-
 content/community/mailing-lists/index.html |  11 ++--
 content/community/merchandise/index.html   |  11 ++--
 content/community/people-credits/index.html|  11 ++--
 content/development/index.html |  14 ++---
 content/development/index.xml  |   4 +-
 content/development/issue-tracker/index.html   |  11 ++--
 content/development/nightly-builds/index.html  |  11 ++--
 .../development/source-code-management/index.html  |  11 ++--
 content/doap.rdf   |   2 +-
 content/documentation/about/index.html |  11 ++--
 content/documentation/faqs/index.html  |  11 ++--
 content/documentation/index.html   |  14 ++---
 content/documentation/index.xml|   4 +-
 content/documentation/javadoc/index.html   |  15 ++---
 content/documentation/tutorials/index.html |  11 ++--
 content/documentation/wiki/index.html  |  11 ++--
 content/download/index.html|  27 
 content/favicon.ico| Bin 0 -> 894 bytes
 content/img/{kube => }/plug.svg|   0
 content/img/{kube => }/plus-square.svg |   0
 content/index.html |  70 ++---
 content/index.xml  |   3 +-
 content/news/index.html|  16 ++---
 content/news/index.xml |   4 +-
 content/news/legacy-nutch-news/index.html  |  11 ++--
 content/news/nutch-1.18-release/index.html |  11 ++--
 content/tags/1.18/index.html   |  16 ++---
 content/tags/index.html|  16 ++---
 content/tags/legacy/index.html |  16 ++---
 content/tags/news/index.html   |  16 ++---
 content/tags/release/index.html|  16 ++---
 40 files changed, 257 insertions(+), 238 deletions(-)

diff --git a/content/apache/index.html b/content/apache/index.html
index adaf256..3d84f47 100644
--- a/content/apache/index.html
+++ b/content/apache/index.html
@@ -2,11 +2,11 @@
 
 
 
-  
+  
   
   
   
-   Apache 
+   Apache Nutch™ – Apache 
 
   

@@ -38,9 +38,11 @@
   
   
   
+  
+
 
 
-  
 
 
   
@@ -117,8 +119,7 @@
 

 
-
- 2004-2021 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target="_blank" rel="noopener 
noreferrer">kube Theme for Hugo. Apache Nutch, Nutch, Apache, the Apache 
feather logo, and the Apache Nutch project logo are trademarks of The Apache 
Software Foundation.
+ 2004-2022 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target="_blank" rel="noopener 
noreferrer">kube Theme for Hugo. Apache Nutch, Nutch, Apache, the Apache 
feather logo, and the Apache Nutch project logo are trademarks of The Apache 
Software Foundation.

 
 
diff --git a/content/categories/index.html b/content/categories/index.html
index fb07b86..ecec8b1 100644
--- a/content/categories/index.html
+++ b/content/categories/index.html
@@ -2,11 +2,11 @@
 
 
 
-  
+  
   
   
   
-   Categories 
+  Apache Nutch™ – Categories
 
   
   
@@ -37,10 +37,11 @@
   
   
   
-  
+
+
 
 
-  
 
 
   
@@ -99,8 +100,8 @@
   
 
 
-  Project News
-News, activity, ideas, and whatever feels important. https://twitter.com/@ApacheNutch;>Follow us on Twitter
+  Categories
+  
 
 
 
@@ -135,8 +136,7 @@
 

 
-
- 2004-2021 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target="_blank" rel="noopener 
noreferrer">kube Theme for Hugo. Apache Nutch, Nutch, Apache, the Apache 
feather logo, and the Apache Nutch project logo are trademarks of The Apache 
Software Foundation.
+ 2004-2022 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target=&qu

[nutch-site] 01/03: - add README for branch asf-site - modify .asf.yaml to contain only instructions required in branch asf-site

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/nutch-site.git

commit 1e7bf4e9e7c2f5450444623847476d1b73d7b773
Author: Sebastian Nagel 
AuthorDate: Thu Sep 8 14:59:33 2022 +0200

- add README for branch asf-site
- modify .asf.yaml to contain only instructions required in branch
  asf-site
---
 .asf.yaml | 17 ++---
 README.md |  7 +++
 2 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/.asf.yaml b/.asf.yaml
index 0cc84e6..2ae5ca7 100644
--- a/.asf.yaml
+++ b/.asf.yaml
@@ -15,20 +15,7 @@
 # specific language governing permissions and limitations
 # under the License.
 
-# 
https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories
-
-github:
-  description: "Apache Nutch Website"
-  homepage: https://nutch.apache.org/
-  labels:
-- apache
-- nutch
-- hugo
-
-  enabled_merge_buttons:
-squash: true
-merge:  false
-rebase: false
+# https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features
 
 publish:
-  whoami: asf-site
\ No newline at end of file
+  whoami: asf-site
diff --git a/README.md b/README.md
new file mode 100644
index 000..dd2eb49
--- /dev/null
+++ b/README.md
@@ -0,0 +1,7 @@
+Apache Nutch Website
+
+
+The `asf-site` branch is only used for storing the generated static website.
+From this branch, the Nutch website is being served.
+
+Please submit patch and pull requests on the `main` branch instead of the 
`asf-site` branch.
\ No newline at end of file



[nutch-site] branch asf-staging updated (3e9e725 -> 2cfe00d)

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


 discard 3e9e725  Announce release of Nutch 1.19
 new 2cfe00d  Announce release of Nutch 1.19

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (3e9e725)
\
 N -- N -- N   refs/heads/asf-staging (2cfe00d)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 content/categories/index.xml   | 4 +-
 content/categories/releases/index.html | 9 +
 content/categories/releases/index.xml  |13 +-
 content/doap.rdf   | 7 +
 .../javadoc/apidocs/allclasses-frame.html  |   549 -
 .../javadoc/apidocs/allclasses-index.html  |  2845 +
 .../{allclasses-noframe.html => allclasses.html}   |83 +-
 .../javadoc/apidocs/allpackages-index.html |   826 ++
 .../javadoc/apidocs/constant-values.html   |  2415 ++--
 .../javadoc/apidocs/deprecated-list.html   |   155 +-
 .../javadoc/apidocs/{package-list => element-list} |18 +-
 .../documentation/javadoc/apidocs/help-doc.html|   169 +-
 .../documentation/javadoc/apidocs/index-all.html   |  7425 ++--
 content/documentation/javadoc/apidocs/index.html   |   885 +-
 .../apidocs/jquery/external/jquery/jquery.js   | 10872 ++
 .../jquery/images/ui-bg_glass_55_fbf9ee_1x400.png  |   Bin 0 -> 335 bytes
 .../jquery/images/ui-bg_glass_65_dadada_1x400.png  |   Bin 0 -> 262 bytes
 .../jquery/images/ui-bg_glass_75_dadada_1x400.png  |   Bin 0 -> 262 bytes
 .../jquery/images/ui-bg_glass_75_e6e6e6_1x400.png  |   Bin 0 -> 262 bytes
 .../jquery/images/ui-bg_glass_95_fef1ec_1x400.png  |   Bin 0 -> 332 bytes
 .../ui-bg_highlight-soft_75_cc_1x100.png   |   Bin 0 -> 280 bytes
 .../jquery/images/ui-icons_22_256x240.png  |   Bin 0 -> 6922 bytes
 .../jquery/images/ui-icons_2e83ff_256x240.png  |   Bin 0 -> 4549 bytes
 .../jquery/images/ui-icons_454545_256x240.png  |   Bin 0 -> 6992 bytes
 .../jquery/images/ui-icons_88_256x240.png  |   Bin 0 -> 6999 bytes
 .../jquery/images/ui-icons_cd0a0a_256x240.png  |   Bin 0 -> 4549 bytes
 .../javadoc/apidocs/jquery/jquery-3.5.1.js | 10872 ++
 .../javadoc/apidocs/jquery/jquery-ui.css   |   582 +
 .../javadoc/apidocs/jquery/jquery-ui.js|  2659 +
 .../javadoc/apidocs/jquery/jquery-ui.min.css   | 7 +
 .../javadoc/apidocs/jquery/jquery-ui.min.js| 6 +
 .../javadoc/apidocs/jquery/jquery-ui.structure.css |   156 +
 .../apidocs/jquery/jquery-ui.structure.min.css | 5 +
 .../jquery/jszip-utils/dist/jszip-utils-ie.js  |56 +
 .../jquery/jszip-utils/dist/jszip-utils-ie.min.js  |10 +
 .../apidocs/jquery/jszip-utils/dist/jszip-utils.js |   118 +
 .../jquery/jszip-utils/dist/jszip-utils.min.js |10 +
 .../javadoc/apidocs/jquery/jszip/dist/jszip.js | 11370 +++
 .../javadoc/apidocs/jquery/jszip/dist/jszip.min.js |13 +
 .../javadoc/apidocs/member-search-index.js | 1 +
 .../javadoc/apidocs/member-search-index.zip|   Bin 0 -> 40331 bytes
 .../nutch/analysis/lang/HTMLLanguageParser.html|   208 +-
 .../analysis/lang/LanguageIndexingFilter.html  |   211 +-
 .../lang/class-use/HTMLLanguageParser.html |94 +-
 .../lang/class-use/LanguageIndexingFilter.html |94 +-
 .../apache/nutch/analysis/lang/package-frame.html  |21 -
 .../nutch/analysis/lang/package-summary.html   |   116 +-
 .../apache/nutch/analysis/lang/package-tree.html   |98 +-
 .../apache/nutch/analysis/lang/package-use.html|90 +-
 .../apache/nutch/any23/Any23IndexingFilter.html|   257 +-
 .../org/apache/nutch/any23/Any23ParseFilter.html   |   266 +-
 .../nutch/any23/class-use/Any23IndexingFilter.html |94 +-
 .../nutch/any23/class-use/Any23ParseFilter.html|94 +-
 .../org/apache/nutch/any23/package-frame.html  |21 -
 .../org/apache/nutch/any23/package-summary.html|   124 +-
 .../org/apache/

[nutch-site] branch asf-staging updated: Announce release of Nutch 1.19

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/asf-staging by this push:
 new 3e9e725  Announce release of Nutch 1.19
3e9e725 is described below

commit 3e9e72539a4fbf9ea5dd3b5a6fafb03a6d0229ca
Author: Sebastian Nagel 
AuthorDate: Thu Sep 8 14:58:43 2022 +0200

Announce release of Nutch 1.19
---
 favicon.ico | Bin 894 -> 0 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)

diff --git a/favicon.ico b/favicon.ico
deleted file mode 100644
index e07ca51..000
Binary files a/favicon.ico and /dev/null differ



svn commit: r56738 [1/3] - /release/nutch/1.19/CHANGES.txt

2022-09-08 Thread snagel
Author: snagel
Date: Thu Sep  8 12:44:33 2022
New Revision: 56738

Log:
Release Apache Nutch 1.19 - add change log

Added:
release/nutch/1.19/CHANGES.txt   (with props)



svn commit: r56738 [3/3] - /release/nutch/1.19/CHANGES.txt

2022-09-08 Thread snagel
Propchange: release/nutch/1.19/CHANGES.txt
--
svn:eol-style = native




svn commit: r56738 [2/3] - /release/nutch/1.19/CHANGES.txt

2022-09-08 Thread snagel
in REST workflow must be ingested into HDFS
+[NUTCH-2329] - Update Slf4j logging for Java 8 and upgrade miredot plugin 
version
+[NUTCH-2336] - SegmentReader to implement Tool
+[NUTCH-2352] - Log with Generic Class Name at Nutch 1.x
+[NUTCH-2355] - Protocol plugins to set cookie if Cookie metadata field is 
present
+[NUTCH-2367] - Get single record from HostDB
+
+New Feature
+
+[NUTCH-2132] - Publisher/Subscriber model for Nutch to emit events
+
+Task
+
+[NUTCH-2171] - Upgrade Nutch Trunk to Java 1.8
+
+ 
+Nutch 1.12 Release 28/05/2016 (dd/mm/)
+Release Report: https://s.apache.org/nutch1.12
+
+Comments
+
+Fellow committers, Nutch 1.12 contains a breaking change NUTCH-2220. Please 
use the note below and
+in the release announcement and keep it on top in this CHANGES.txt for the 
Nutch 1.12 release.
+
+* replace your old conf/nutch-default.xml with the conf/nutch-default.xml from 
Nutch 1.12 release
+* if you use LinkDB (e.g. invertlinks) and modified parameters db.max.inlinks 
and/or db.max.anchor.length
+  and/or db.ignore.internal.links, rename those parameters to 
linkdb.max.inlinks and
+  linkdb.max.anchor.length and linkdb.ignore.internal.links
+* db.ignore.internal.links and db.ignore.external.links now operate on the 
CrawlDB only
+* linkdb.ignore.internal.links and linkdb.ignore.external.links now operate on 
the LinkDB only
+
+Sub-task
+
+[NUTCH-2250] - CommonCrawlDumper : Invalid format + skipped parts
+
+Bug
+
+[NUTCH-2042] - parse-html increase chunk size used to detect charset
+[NUTCH-2180] - FileDumper dumps data, but breaks midway on corrupt segments
+[NUTCH-2189] - Domain filter must deactivate if no rules are present
+[NUTCH-2203] - Suffix URL filter can't handle trailing/leading whitespaces
+[NUTCH-2206] - Provide example scoring.similarity.stopword.file
+[NUTCH-2213] - CommonCrawlDataDumper saves gzipped body in extracted form
+[NUTCH-2223] - Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika 
mimetype detection
+[NUTCH-2224] - Average bytes/second calculated incorrectly in fetcher
+[NUTCH-2225] - Parsed time calculated incorrectly
+[NUTCH-2228] - Plugin index-replace unit test broken on Java 8
+[NUTCH-2232] - DeduplicationJob should decode URL's before length is 
compared
+[NUTCH-2241] - Unstable Selenium plugin in Nutch. Fixed bugs and enhanced 
configuration
+[NUTCH-2256] - Inconsistent log level practice
+
+Improvement
+
+[NUTCH-1233] - Rely on Tika for outlink extraction
+[NUTCH-1712] - Use MultipleInputs in Injector to make it a single 
mapreduce job
+[NUTCH-2172] - index-more: document format of contenttype-mapping.txt
+[NUTCH-2178] - DeduplicationJob to optionally group on host or domain
+[NUTCH-2182] - Make reverseUrlDirs file dumper option hash the URL for 
consistency
+[NUTCH-2183] - Improvement to SegmentChecker for skipping non-segments 
present in segments directory
+[NUTCH-2187] - Change FileDumper SHAs to all uppercase
+[NUTCH-2195] - IndexingFilterChecker to optionally follow N redirects
+[NUTCH-2196] - IndexingFilterChecker to optionally normalize
+[NUTCH-2197] - Add solr5 solrcloud indexer support
+[NUTCH-2204] - Remove junit lib from runtime
+[NUTCH-2218] - Switch CrawlCompletion arg parsing to Commons CLI
+[NUTCH-2221] - Introduce db.ignore.internal.links to FetcherThread
+[NUTCH-2229] - Allow Jexl expressions on CrawlDatum's fixed attributes
+[NUTCH-2231] - Jexl support in generator job
+[NUTCH-2252] - Allow phantomjs as a browser for selenium options
+[NUTCH-2263] - Support for mingram and maxgram at Unigram Cosine 
Similarity Model
+
+New Feature
+
+[NUTCH-961] - Expose Tika's boilerpipe support
+[NUTCH-1325] - HostDB for Nutch
+[NUTCH-2144] - Plugin to override db.ignore.external to exempt interesting 
external domain URLs
+[NUTCH-2190] - Protocol normalizer
+[NUTCH-2191] - Add protocol-htmlunit
+[NUTCH-2194] - Run IndexingFilterChecker as simple Telnet server
+[NUTCH-2219] - Criteria order to be configurable in DeduplicationJob
+[NUTCH-2227] - RegexParseFilter
+[NUTCH-2245] - Developed the NGram Model on the existing Unigram Cosine 
Similarity Model
+
+Task
+
+[NUTCH-2201] - Remove loops program from webgraph package
+[NUTCH-2211] - Filter and normalizer checkers missing in bin/nutch
+[NUTCH-2220] - Rename db.* options used only by the linkdb to linkdb.*
+
+Nutch 1.11 Release 03/12/2015 (dd/mm/)
+Release Report: http://s.apache.org/nutch11
+
+* NUTCH-2176 Clean up of log4j.properties (markus)
+
+* NUTCH-2107 plugin.xml to validate against plugin.dtd (snagel)
+
+* NUTCH-2177 Generator produces only one partition even in distributed mode 
(jnioche, snagel)
+
+* NUTCH-2158 Upgrade to Tika 1.11 (jnioche, snagel)
+
+* NUTCH-2175 Typos in property descriptions in nutch-default.xml (Roannel 
Fernández Hernández via snagel)
+
+* NUTCH-2069 Ignore external links base

[nutch-site] branch main updated: NUTCH-1999 Add /robots.txt to Nutch site (#1)

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/main by this push:
 new 4efc5a9  NUTCH-1999 Add /robots.txt to Nutch site (#1)
4efc5a9 is described below

commit 4efc5a9aca57430549b44a30191de041224ab865
Author: Sebastian Nagel 
AuthorDate: Thu Sep 8 14:19:10 2022 +0200

NUTCH-1999 Add /robots.txt to Nutch site (#1)

- add robots.txt
- add template to generate sitemap
- include sitemap in robots.txt
---
 config.toml  |  2 ++
 content/robots.txt   |  4 
 layouts/_default/sitemap.xml | 10 ++
 3 files changed, 16 insertions(+)

diff --git a/config.toml b/config.toml
index cc8832a..a78ef2d 100644
--- a/config.toml
+++ b/config.toml
@@ -11,6 +11,7 @@ Paginate = 4
 unsafe = true # allow raw HTML in markdown content
 
 [Params]
+  siteBaseURL = "https://nutch.apache.org;
   RSSLink = "/index.xml"
   author = "Apache Nutch Project Management Committee"
   github = "https://github.com/apache/nutch;
@@ -41,3 +42,4 @@ unsafe = true # allow raw HTML in markdown content
 name = "Apache"
 weight = -100
 url = "/apache/"
+
diff --git a/content/robots.txt b/content/robots.txt
new file mode 100644
index 000..086e6ad
--- /dev/null
+++ b/content/robots.txt
@@ -0,0 +1,4 @@
+User-agent: *
+Allow: /
+
+Sitemap: https://nutch.apache.org/sitemap.xml
\ No newline at end of file
diff --git a/layouts/_default/sitemap.xml b/layouts/_default/sitemap.xml
new file mode 100644
index 000..006e6ba
--- /dev/null
+++ b/layouts/_default/sitemap.xml
@@ -0,0 +1,10 @@
+{{ printf "" | 
safeHTML }}
+http://www.sitemaps.org/schemas/sitemap/0.9;
+  xmlns:xhtml="http://www.w3.org/1999/xhtml;>
+  
+  {{ range .Data.Pages }}{{ if ne .Params.sitemapExclude true }}
+{{ $url := urls.Parse .Permalink }}
+  {{ .Site.Params.SiteBaseURL }}{{ $url.Path }}{{ if not 
.Lastmod.IsZero }}
+  {{ safeHTML ( .Lastmod.Format "2006-01-02T15:04:05-07:00" ) 
}}{{ end }}
+{{ end }}{{ end }}
+



[nutch-site] branch asf-staging updated: - add README for branch asf-staging - modify .asf.yaml to contain only instructions required in branch asf-staging

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/asf-staging by this push:
 new b649bf6  - add README for branch asf-staging - modify .asf.yaml to 
contain only instructions required in branch   asf-staging
b649bf6 is described below

commit b649bf67d76f263402bdb05af69d38b2fc2d61cc
Author: Sebastian Nagel 
AuthorDate: Thu Sep 8 13:52:12 2022 +0200

- add README for branch asf-staging
- modify .asf.yaml to contain only instructions required in branch
  asf-staging
---
 .asf.yaml | 18 +-
 README.md |  8 
 2 files changed, 9 insertions(+), 17 deletions(-)

diff --git a/.asf.yaml b/.asf.yaml
index d845c31..3ef9b83 100644
--- a/.asf.yaml
+++ b/.asf.yaml
@@ -17,22 +17,6 @@
 
 # https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features
 
-github:
-  description: "Apache Nutch Website"
-  homepage: https://nutch.apache.org/
-  labels:
-- apache
-- nutch
-- hugo
-
-  enabled_merge_buttons:
-squash: true
-merge:  false
-rebase: false
-
 staging:
   profile: ~
-  whoami:  asf-staging
-
-publish:
-  whoami: asf-site
\ No newline at end of file
+  whoami: asf-staging
diff --git a/README.md b/README.md
new file mode 100644
index 000..e2c7c89
--- /dev/null
+++ b/README.md
@@ -0,0 +1,8 @@
+Apache Nutch Website
+
+
+The `asf-staging` branch is only used for storing the generated static website 
to preview proposed changes to the website.
+
+The preview site is at https://nutch.staged.apache.org
+
+Please submit patch and pull requests on the `main` branch instead of the 
`asf-staging` branch.



[nutch-site] branch NUTCH-1999-nutch-site-robots-txt updated (142489f -> f863c1f)

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch NUTCH-1999-nutch-site-robots-txt
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


omit 142489f  NUTCH-1999 Add /robots.txt to Nutch site
 add 6e318e6  Add modified Kube theme - taken from commit a00af40 of 
https://github.com/jeblister/kube - with the following modifications   - 
modified layouts/index.html to include body from _index.md   - added section 
landing pages layouts/section/{community,development,documentation}.html   
- replace favicon by Nutch one (fix NUTCH-2928)   - added custom.css - update 
README
 add 6d78162  NUTCH-1999 Add /robots.txt to Nutch site
 add f863c1f  NUTCH-1999 Add /robots.txt to Nutch site - add template to 
generate sitemap - include sitemap in robots.txt

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (142489f)
\
 N -- N -- N   refs/heads/NUTCH-1999-nutch-site-robots-txt (f863c1f)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .asf.yaml  |6 +-
 README.md  |   16 +-
 config.toml|8 +-
 content/_index.md  |   34 +-
 content/community/_index.md|3 +
 content/development/_index.md  |3 +
 content/doap.rdf   |2 +-
 content/documentation/_index.md|3 +
 content/download.md|8 +-
 content/favicon.ico|  Bin 0 -> 894 bytes
 content/news/_index.md |4 +
 content/robots.txt |2 +
 layouts/_default/sitemap.xml   |   10 +
 static/img/plug.svg|3 +
 static/img/plus-square.svg |1 +
 themes/kube|1 -
 themes/kube/LICENSE.md |   20 +
 themes/kube/README.md  |  169 ++
 {archetypes => themes/kube/archetypes}/blog.md |0
 {archetypes => themes/kube/archetypes}/docs.md |0
 themes/kube/images/docs.png|  Bin 0 -> 87844 bytes
 themes/kube/images/faq.png |  Bin 0 -> 130413 bytes
 themes/kube/images/list-docs.png   |  Bin 0 -> 69093 bytes
 themes/kube/images/post.png|  Bin 0 -> 73445 bytes
 themes/kube/images/screenshot.png  |  Bin 0 -> 71302 bytes
 themes/kube/images/signin.png  |  Bin 0 -> 39120 bytes
 themes/kube/images/tn.png  |  Bin 0 -> 37334 bytes
 .gitignore => themes/kube/layouts/404.html |0
 themes/kube/layouts/_default/baseof.html   |   66 +
 themes/kube/layouts/_default/list.html |   19 +
 themes/kube/layouts/_default/single.html   |   18 +
 themes/kube/layouts/blog/single.html   |   63 +
 themes/kube/layouts/docs/single.html   |   23 +
 themes/kube/layouts/index.html |   15 +
 themes/kube/layouts/partials/favicon.html  |1 +
 themes/kube/layouts/partials/footer.html   |3 +
 themes/kube/layouts/partials/header.html   |   29 +
 themes/kube/layouts/partials/meta/name-author.html |6 +
 themes/kube/layouts/partials/meta/ogimage.html |8 +
 themes/kube/layouts/partials/page-summary.html |9 +
 themes/kube/layouts/partials/pagination.html   |   15 +
 themes/kube/layouts/partials/post/byauthor.html|   20 +
 .../kube/layouts/partials/post/category-link.html  |1 +
 themes/kube/layouts/partials/post/meta.html|   14 +
 .../layouts/partials/post/related-content.html |   16 +
 themes/kube/layouts/partials/post/tag-link.html|1 +
 .../kube/layouts/partials/scripts/animation.html   |  127 ++
 .../kube/layouts/partials/site-verification.html   |   12 +
 themes/kube/layouts/partials/toc.html  |   21 +
 themes/kube/layouts/section/community.html |   22 +
 themes/kube/layouts/section/development.html   |   22 +
 themes/kube/layouts/section/documentation.html |   22 +
 themes/kube/layouts/section

[nutch-site] branch asf-staging updated: Sync .asf.yaml file with main branch

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/asf-staging by this push:
 new ee7f0b2  Sync .asf.yaml file with main branch
ee7f0b2 is described below

commit ee7f0b22b8562b8550c8172b7817eb5c089bc221
Author: Sebastian Nagel 
AuthorDate: Thu Sep 8 12:07:34 2022 +0200

Sync .asf.yaml file with main branch
---
 .asf.yaml | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/.asf.yaml b/.asf.yaml
index 0cc84e6..d845c31 100644
--- a/.asf.yaml
+++ b/.asf.yaml
@@ -15,7 +15,7 @@
 # specific language governing permissions and limitations
 # under the License.
 
-# 
https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories
+# https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features
 
 github:
   description: "Apache Nutch Website"
@@ -30,5 +30,9 @@ github:
 merge:  false
 rebase: false
 
+staging:
+  profile: ~
+  whoami:  asf-staging
+
 publish:
   whoami: asf-site
\ No newline at end of file



[nutch-site] 01/01: Update content from Hugo build after adding Kube modified templates

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/nutch-site.git

commit d77dbb51645aa6a0249564730aba051ac5585a2e
Author: Sebastian Nagel 
AuthorDate: Thu Sep 8 10:54:37 2022 +0200

Update content from Hugo build after adding Kube modified templates
---
 content/apache/index.html  |  11 ++--
 content/categories/index.html  |  16 ++---
 content/categories/news/index.html |  16 ++---
 content/categories/releases/index.html |  16 ++---
 content/community/board-reporting/index.html   |  11 ++--
 content/community/bot/index.html   |  11 ++--
 content/community/contributing/index.html  |  11 ++--
 content/community/index.html   |  14 ++---
 content/community/index.xml|   4 +-
 content/community/mailing-lists/index.html |  11 ++--
 content/community/merchandise/index.html   |  11 ++--
 content/community/people-credits/index.html|  11 ++--
 content/development/index.html |  14 ++---
 content/development/index.xml  |   4 +-
 content/development/issue-tracker/index.html   |  11 ++--
 content/development/nightly-builds/index.html  |  11 ++--
 .../development/source-code-management/index.html  |  11 ++--
 content/doap.rdf   |   2 +-
 content/documentation/about/index.html |  11 ++--
 content/documentation/faqs/index.html  |  11 ++--
 content/documentation/index.html   |  14 ++---
 content/documentation/index.xml|   4 +-
 content/documentation/javadoc/index.html   |  15 ++---
 content/documentation/tutorials/index.html |  11 ++--
 content/documentation/wiki/index.html  |  11 ++--
 content/download/index.html|  27 
 content/favicon.ico| Bin 0 -> 894 bytes
 content/img/{kube => }/plug.svg|   0
 content/img/{kube => }/plus-square.svg |   0
 content/index.html |  70 ++---
 content/index.xml  |   3 +-
 content/news/index.html|  16 ++---
 content/news/index.xml |   4 +-
 content/news/legacy-nutch-news/index.html  |  11 ++--
 content/news/nutch-1.18-release/index.html |  11 ++--
 content/tags/1.18/index.html   |  16 ++---
 content/tags/index.html|  16 ++---
 content/tags/legacy/index.html |  16 ++---
 content/tags/news/index.html   |  16 ++---
 content/tags/release/index.html|  16 ++---
 40 files changed, 257 insertions(+), 238 deletions(-)

diff --git a/content/apache/index.html b/content/apache/index.html
index adaf256..3d84f47 100644
--- a/content/apache/index.html
+++ b/content/apache/index.html
@@ -2,11 +2,11 @@
 
 
 
-  
+  
   
   
   
-   Apache 
+   Apache Nutch™ – Apache 
 
   

@@ -38,9 +38,11 @@
   
   
   
+  
+
 
 
-  
 
 
   
@@ -117,8 +119,7 @@
 

 
-
- 2004-2021 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target="_blank" rel="noopener 
noreferrer">kube Theme for Hugo. Apache Nutch, Nutch, Apache, the Apache 
feather logo, and the Apache Nutch project logo are trademarks of The Apache 
Software Foundation.
+ 2004-2022 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target="_blank" rel="noopener 
noreferrer">kube Theme for Hugo. Apache Nutch, Nutch, Apache, the Apache 
feather logo, and the Apache Nutch project logo are trademarks of The Apache 
Software Foundation.

 
 
diff --git a/content/categories/index.html b/content/categories/index.html
index fb07b86..ecec8b1 100644
--- a/content/categories/index.html
+++ b/content/categories/index.html
@@ -2,11 +2,11 @@
 
 
 
-  
+  
   
   
   
-   Categories 
+  Apache Nutch™ – Categories
 
   
   
@@ -37,10 +37,11 @@
   
   
   
-  
+
+
 
 
-  
 
 
   
@@ -99,8 +100,8 @@
   
 
 
-  Project News
-News, activity, ideas, and whatever feels important. https://twitter.com/@ApacheNutch;>Follow us on Twitter
+  Categories
+  
 
 
 
@@ -135,8 +136,7 @@
 

 
-
- 2004-2021 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target="_blank" rel="noopener 
noreferrer">kube Theme for Hugo. Apache Nutch, Nutch, Apache, the Apache 
feather logo, and the Apache Nutch project logo are trademarks of The Apache 
Software Foundation.
+ 2004-2022 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target=&qu

[nutch-site] branch asf-staging created (now d77dbb5)

2022-09-08 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


  at d77dbb5  Update content from Hugo build after adding Kube modified 
templates

This branch includes the following new commits:

 new d77dbb5  Update content from Hugo build after adding Kube modified 
templates

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.




svn commit: r56686 - /dev/nutch/1.19/ /release/nutch/1.19/

2022-09-06 Thread snagel
Author: snagel
Date: Tue Sep  6 08:51:59 2022
New Revision: 56686

Log:
Release Apache Nutch 1.19

Added:
release/nutch/1.19/
  - copied from r56685, dev/nutch/1.19/
Removed:
dev/nutch/1.19/



svn commit: r56398 - /dev/nutch/1.19/

2022-08-22 Thread snagel
Author: snagel
Date: Mon Aug 22 15:15:43 2022
New Revision: 56398

Log:
Stage Apache Nutch 1.19  RC#1

Added:
dev/nutch/1.19/
dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz   (with props)
dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz.asc
dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz.sha512
dev/nutch/1.19/apache-nutch-1.19-bin.zip   (with props)
dev/nutch/1.19/apache-nutch-1.19-bin.zip.asc
dev/nutch/1.19/apache-nutch-1.19-bin.zip.sha512
dev/nutch/1.19/apache-nutch-1.19-src.tar.gz   (with props)
dev/nutch/1.19/apache-nutch-1.19-src.tar.gz.asc
dev/nutch/1.19/apache-nutch-1.19-src.tar.gz.sha512
dev/nutch/1.19/apache-nutch-1.19-src.zip   (with props)
dev/nutch/1.19/apache-nutch-1.19-src.zip.asc
dev/nutch/1.19/apache-nutch-1.19-src.zip.sha512

Added: dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz
==
Binary file - no diff available.

Propchange: dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz
--
svn:mime-type = application/x-gzip

Added: dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz.asc
==
--- dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz.asc (added)
+++ dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz.asc Mon Aug 22 15:15:43 2022
@@ -0,0 +1,11 @@
+-BEGIN PGP SIGNATURE-
+
+iQEzBAABCgAdFiEE/4Kkh/ktcOUv934Kxm6nt9sKnG0FAmMDm+wACgkQxm6nt9sK
+nG0wqwf/XJTnVZ67AYZvkBorERVEvjnurC9L9FY4/7QHwh2z9q0Viftt6ODIIEAD
+IjkHB8xN9cWvFiyFhG/4NFWFnQNiTUlrZ6Ppu1eYXXvI312Z++vVMMOkVVmdn+5K
+S3YhoejqkO0GeMqV4PcXAiLF0/DtxaSPp+q0O29+XilSw5XB8mlHBn7VWALT5Y6s
+tD5WQfBNbyOCnF4dp2eDcIZjuPof/TbIhyDU3GBNXRe772cXQIl4JrdxyftiMwz4
+m4eMdiY/lNPVEb93X6eCkyzApirZBmtfnJOKcBZsuqOFPFB9TumLWaTA4IBu8d/K
+KvDDeAwmF/3t95wPVMGkOFgBLpsdTw==
+=REOb
+-END PGP SIGNATURE-

Added: dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz.sha512
==
--- dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz.sha512 (added)
+++ dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz.sha512 Mon Aug 22 15:15:43 2022
@@ -0,0 +1 @@
+SHA512(apache-nutch-1.19-bin.tar.gz)= 
ba4bfeb92fc5c95b71ab87df46baebcd904f8ea99f33824b7016a9b6a7f4d2488598c94d44954f8d5a7937d8f38164af197efdcfcf3d4f6ef0beea3ff32f0d7a

Added: dev/nutch/1.19/apache-nutch-1.19-bin.zip
==
Binary file - no diff available.

Propchange: dev/nutch/1.19/apache-nutch-1.19-bin.zip
--
svn:mime-type = application/octet-stream

Added: dev/nutch/1.19/apache-nutch-1.19-bin.zip.asc
==
--- dev/nutch/1.19/apache-nutch-1.19-bin.zip.asc (added)
+++ dev/nutch/1.19/apache-nutch-1.19-bin.zip.asc Mon Aug 22 15:15:43 2022
@@ -0,0 +1,11 @@
+-BEGIN PGP SIGNATURE-
+
+iQEzBAABCgAdFiEE/4Kkh/ktcOUv934Kxm6nt9sKnG0FAmMDm+4ACgkQxm6nt9sK
+nG0LVwgAx9rHYfP2eKvyTI8IFYr0uToH4kqaMqdlJyUilkBi3ZetnkqaNrz3Lt+J
+BWp6VojbizExOMABGe8CM52+bIcxA5PSyU8IEKCS1KVSCBsiLlghv3Y3jEQX366p
+SbSYBhUR8F4owxsHDOI6qlN1tqL72t4kDOOR/LcESQ1IkvGBXPTvU4a0XzWvLphM
+R0GPYMgLaK6TEt04SVEWGjz4bDOwKHpxsOJQjhEzmaY0JPbOwCe4kIb4oqgqNnfH
+4Yu+VSGQVruEr4u6qFfAV2EJdne2Yayc/KY5d7cZ+cvWto5/QMNhUZ3hZEoipott
+ok637G1CU363BNsmmLHsM7lJMO+FTA==
+=mYsK
+-END PGP SIGNATURE-

Added: dev/nutch/1.19/apache-nutch-1.19-bin.zip.sha512
==
--- dev/nutch/1.19/apache-nutch-1.19-bin.zip.sha512 (added)
+++ dev/nutch/1.19/apache-nutch-1.19-bin.zip.sha512 Mon Aug 22 15:15:43 2022
@@ -0,0 +1 @@
+SHA512(apache-nutch-1.19-bin.zip)= 
4ef0b4969836d10e85851e30cffaf73bfa269bf1e974408329cc824bfae94c9caf3575d937e0b4532a9d3fc9dfe84df78180e28ff0d602b6fff4bcde0faa884a

Added: dev/nutch/1.19/apache-nutch-1.19-src.tar.gz
==
Binary file - no diff available.

Propchange: dev/nutch/1.19/apache-nutch-1.19-src.tar.gz
--
svn:mime-type = application/x-gzip

Added: dev/nutch/1.19/apache-nutch-1.19-src.tar.gz.asc
==
--- dev/nutch/1.19/apache-nutch-1.19-src.tar.gz.asc (added)
+++ dev/nutch/1.19/apache-nutch-1.19-src.tar.gz.asc Mon Aug 22 15:15:43 2022
@@ -0,0 +1,11 @@
+-BEGIN PGP SIGNATURE-
+
+iQEzBAABCgAdFiEE/4Kkh/ktcOUv934Kxm6nt9sKnG0FAmMDm+8ACgkQxm6nt9sK
+nG2lagf7BAd8rl2yGL2sZGle9PUzIwBw3jLby/ZIl88aLs7FV1oIHxVlS3lnNJPM
+LKIGIkZGYUiQ4xF8v9aLl1NQ6p49Gn9nKoUboVpVkzWcknIdlTxnt2qjgHoH6THb
+sIinMlI3IFKXSey76F38JToiU5ycrQPs+nnJZZjFsl/Hg+5jozoJwO/YHJID8yEs
+p32n4H4Ll+El6zsgJvGlE4M1hB3tfv4QpKAcW4swNIjlD12gdiY44oahNbQkd/7v
+15M10wgmW1LOIFfxtbTU

[nutch] branch branch-1.19 created (now 63d4f11c0)

2022-08-22 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch branch-1.19
in repository https://gitbox.apache.org/repos/asf/nutch.git


  at 63d4f11c0 Nutch 1.19 release - update current year in API docs etc. - 
update version number - add changes / release notes - update links to Hadoop 
API docs

No new revisions were added by this update.



[nutch] annotated tag release-1.19 updated (63d4f11c0 -> 5d7660ceb)

2022-08-22 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to annotated tag release-1.19
in repository https://gitbox.apache.org/repos/asf/nutch.git


*** WARNING: tag release-1.19 was modified! ***

from 63d4f11c0 (commit)
  to 5d7660ceb (tag)
 tagging 63d4f11c08aa7f5a3f5e3dded3c880649fd6e1a2 (commit)
 replaces release-1.13
  by Sebastian Nagel
  on Mon Aug 22 16:59:01 2022 +0200

- Log -
Apache Nutch 1.19 RC#1 Tag
---


No new revisions were added by this update.

Summary of changes:



[nutch] branch master updated: NUTCH-2969 Javadoc: Javascript search is not working when built on JDK 11 - pass --no-module-directories to javadoc target when building on JDK 11 - remove obsolete cond

2022-08-22 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new ffe059892 NUTCH-2969 Javadoc: Javascript search is not working when 
built on JDK 11 - pass --no-module-directories to javadoc target when building 
on JDK 11 - remove obsolete condition to fail javadoc builds on JDK 7u25 and 
earlier
ffe059892 is described below

commit ffe0598925fcb27ac253f61d86106b33a260a979
Author: Sebastian Nagel 
AuthorDate: Mon Aug 22 15:18:50 2022 +0200

NUTCH-2969 Javadoc: Javascript search is not working when built on JDK 11
- pass --no-module-directories to javadoc target when building on JDK 11
- remove obsolete condition to fail javadoc builds on JDK 7u25 and earlier
---
 build.xml | 40 ++--
 1 file changed, 18 insertions(+), 22 deletions(-)

diff --git a/build.xml b/build.xml
index 0e1e42f7c..d7377ab25 100644
--- a/build.xml
+++ b/build.xml
@@ -16,6 +16,7 @@
  limitations under the License.
 -->
 
   
 
+  
+
+  
+
   
   
 
@@ -164,17 +169,6 @@
   
   
 
-https://issues.apache.org/jira/browse/NUTCH-1590;>
-  
-
-  
-  
-  
-  
-
-  
-
-
 
 
   
   
+  
+  
 
   
-  
+  
   
   
   
@@ -684,16 +684,6 @@
   
   
   
-https://issues.apache.org/jira/browse/NUTCH-1590;>
-  
-
-  
-  
-  
-  
-
-  
-
 
 
 
   
   
+  
+  
 
   
   



[nutch] branch master updated (bca5fc0d0 -> 635ef2f3b)

2022-08-21 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from bca5fc0d0 NUTCH-2795 CrawlDbReader: compress CrawlDb dumps if 
configured - configure CSV and JSON LineRecordWriters to compress the output   
files according to the configuration
 new 3199dee64 NUTCH-2963 Upgrade dependencies before release of 1.19 - 
upgrade dependency-check ant plugin
 new cdc67c9ed NUTCH-2963 Upgrade dependencies before release of 1.19 - 
upgrade urlfilter-automaton to depend on dk.brics automaton 1.12-4
 new 0c283980d NUTCH-2963 Upgrade dependencies before release of 1.19 - 
upgrade indexer-solr dependencies:   - Solr 8.5.1 -> 8.11.2   - httpmime 4.5.10 
-> 4.5.13   - httpcore 4.4.12 -> 4.4.15
 new ef7c102eb NUTCH-2963 Upgrade dependencies before release of 1.19 - 
upgrade Hadoop 3.3.3 -> 3.3.4 - adapt ivy retrieve pattern to optionally 
include the `classifier`   (used in Hadoop deps to differentiate between 
architecture:x86_64 vs. aarch_64)
 new 148c8f8a0 NUTCH-2963 Upgrade dependencies before release of 1.19 - 
update / complete LICENSE-binary
 new 59f7865e9 NUTCH-2843 Duplicate declaration of dependencies in ivy.xml 
- remove duplicated dependencies: commons-collections4 and httpclient - move 
Maven POM creation into separate target to reproduce issue
 new 0442562ed NUTCH-2963 Upgrade dependencies before release of 1.19 - 
upgrade Nutch core dependencieshttpcore-nio 4.4.9 -> 4.4.14cxf 2.9.0 -> 
2.9.1commons-jexl3 3.2.1 -> 3.3log4j 2.17.2 -> 2.18.0t-digest 3.2 
-> 3.3 - update / complete LICENSE-binary
 new 635ef2f3b Merge pull request #747 from 
sebastian-nagel/NUTCH-2963-upgrade-dependencies

The 3331 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 LICENSE-binary|  30 -
 NOTICE-binary |  56 +++-
 build.xml | 108 +-
 ivy/ivy.xml   |  35 +-
 src/plugin/indexer-solr/ivy.xml   |  17 +++--
 src/plugin/indexer-solr/plugin.xml|  62 -
 src/plugin/urlfilter-automaton/ivy.xml|   2 +-
 src/plugin/urlfilter-automaton/plugin.xml |   2 +-
 8 files changed, 162 insertions(+), 150 deletions(-)



[nutch] branch master updated (bec577d50 -> bca5fc0d0)

2022-08-21 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from bec577d50 NUTCH-2863 Injector to parse command-line flags 
case-insensitive
 add bca5fc0d0 NUTCH-2795 CrawlDbReader: compress CrawlDb dumps if 
configured - configure CSV and JSON LineRecordWriters to compress the output   
files according to the configuration

No new revisions were added by this update.

Summary of changes:
 src/java/org/apache/nutch/crawl/CrawlDbReader.java | 53 ++
 1 file changed, 43 insertions(+), 10 deletions(-)



[nutch] branch master updated: NUTCH-2962 Update and complete package info of protocol plugins

2022-08-19 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 6f4c80b7f NUTCH-2962 Update and complete package info of protocol 
plugins
6f4c80b7f is described below

commit 6f4c80b7fca0f0236e079587212c2adcadba3a69
Author: Sebastian Nagel 
AuthorDate: Mon Aug 15 16:29:34 2022 +0200

NUTCH-2962 Update and complete package info of protocol plugins
---
 .../org/apache/nutch/protocol/http/api/package-info.java   |  2 +-
 .../org/apache/nutch/protocol/htmlunit/package-info.java   |  8 +++-
 .../org/apache/nutch/protocol/httpclient/package-info.java | 14 +++---
 .../interactiveselenium/handlers}/package-info.java|  6 --
 .../nutch/protocol/interactiveselenium/package-info.java   |  5 -
 .../org/apache/nutch/protocol/okhttp/package-info.java |  4 +++-
 .../org/apache/nutch/protocol/selenium/package-info.java   |  5 -
 7 files changed, 30 insertions(+), 14 deletions(-)

diff --git 
a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/package-info.java
 
b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/package-info.java
index a99b4bac7..8cacc3eef 100644
--- 
a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/package-info.java
+++ 
b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/package-info.java
@@ -17,6 +17,6 @@
 
 /** 
  * Common API used by HTTP plugins ({@link org.apache.nutch.protocol.http 
http},
- * {@link org.apache.nutch.protocol.httpclient httpclient})
+ * {@link org.apache.nutch.protocol.httpclient httpclient}, etc.)
  */
 package org.apache.nutch.protocol.http.api;
diff --git 
a/src/plugin/protocol-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/package-info.java
 
b/src/plugin/protocol-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/package-info.java
index bf4902c25..80fabce3b 100644
--- 
a/src/plugin/protocol-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/package-info.java
+++ 
b/src/plugin/protocol-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/package-info.java
@@ -15,5 +15,11 @@
  * limitations under the License.
  */
 
-/** Protocol plugin which supports retrieving documents via the http 
protocol.*/
+/**
+ * Protocol plugin which supports retrieving documents via HTTP/HTTPS using
+ * https://www.selenium.dev/;>Selenium and the
+ * https://github.com/SeleniumHQ/htmlunit-driver;>HtmlUnitDriver web
+ * driver for the for the
+ * https://htmlunit.sourceforge.io/;>HtmlUnit headless browser.
+ */
 package org.apache.nutch.protocol.htmlunit;
diff --git 
a/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/package-info.java
 
b/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/package-info.java
index 251204485..e3c390355 100644
--- 
a/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/package-info.java
+++ 
b/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/package-info.java
@@ -15,12 +15,12 @@
  * limitations under the License.
  */
 
-/** 
- * Protocol plugin which supports retrieving documents via the 
- * HTTP andHTTPS protocols, optionally with Basic, Digest and 
- * NTLM authentication schemes for web server as well as 
- * proxy server. It handles cookies within a single fetch 
- * operation. This plugin is based on Jakarta Commons 
- * HttpClient library.
+/**
+ * Protocol plugin which supports retrieving documents via the HTTP andHTTPS
+ * protocols, optionally with Basic, Digest and NTLM authentication schemes for
+ * web server as well as proxy server. It handles cookies within a single fetch
+ * operation and offers support for POST authentication via HTML forms. This
+ * plugin is based on the https://hc.apache.org/;>Apache 
HttpClient
+ * library.
  */
 package org.apache.nutch.protocol.httpclient;
diff --git 
a/src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/package-info.java
 
b/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/package-info.java
similarity index 78%
copy from 
src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/package-info.java
copy to 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/package-info.java
index 7bdf14a75..407cb7fc8 100644
--- 
a/src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/package-info.java
+++ 
b/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/package-info.java
@@ -16,6 +16,8 @@
  */
 
 /**
- * Protocol plugin based on https://github.com/square/okhttp;>okhttp, supports http, https, 
http/2.
+ * Handler implementations to interact with
+ * https://www.selenium.dev/;>Selen

[nutch] branch master updated: NUTCH-2930 Protocol-okhttp: implement IP filter (#736)

2022-08-19 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 7e969eaec NUTCH-2930 Protocol-okhttp: implement IP filter (#736)
7e969eaec is described below

commit 7e969eaec1ab8e9e21667faf6cf1881fb10cfb31
Author: Sebastian Nagel 
AuthorDate: Fri Aug 19 15:26:07 2022 +0200

NUTCH-2930 Protocol-okhttp: implement IP filter (#736)

- add include/exclude rules as list of IP address, CIDR notation
  or predefined IP ranges (localhost, loopback, sitelocal)
---
 conf/nutch-default.xml |  25 +++
 .../org/apache/nutch/protocol/okhttp/CIDR.java |  79 
 .../nutch/protocol/okhttp/IPFilterRules.java   | 129 +
 .../org/apache/nutch/protocol/okhttp/OkHttp.java   |  35 
 .../protocol/okhttp/TestBadServerResponses.java|   2 +-
 .../protocol/okhttp/TestIPAddressFiltering.java| 207 +
 .../nutch/protocol/okhttp/TestProtocolOkHttp.java  |   2 +-
 .../protocol/AbstractHttpProtocolPluginTest.java   |  22 ++-
 8 files changed, 494 insertions(+), 7 deletions(-)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 1ad02a021..2a6325884 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -449,6 +449,31 @@
   
 
 
+
+  http.filter.ipaddress.include
+  
+  
+If not empty: only fetch content from these IP addresses defined
+as a comma-separated list of a single IP address, a CIDR notation,
+or one of the following pre-defined IP address types: localhost,
+loopback, sitelocal. The property http.filter.ipaddress.exclude
+can be used to block subranges in the included list of ranges.
+Note: supported only by protocol-okhttp.
+  
+
+
+
+  http.filter.ipaddress.exclude
+  
+  
+If not empty: do not fetch content from these IP addresses defined
+as a comma-separated list of a single IP address, a CIDR notation,
+or one of the following pre-defined IP address types: localhost,
+loopback, sitelocal. Note: supported only by protocol-okhttp.
+  
+
+
+
 
 
 
diff --git 
a/src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/CIDR.java
 
b/src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/CIDR.java
new file mode 100644
index 0..3add082a8
--- /dev/null
+++ 
b/src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/CIDR.java
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.protocol.okhttp;
+
+import java.net.InetAddress;
+
+import com.google.common.net.InetAddresses;
+
+/**
+ * Parse a https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing;>CIDR 
block
+ * notation and test whether an IP address is contained in the subnet range
+ * defined by the CIDR.
+ */
+public class CIDR {
+  InetAddress addr;
+  int mask;
+
+  public CIDR(InetAddress address, int mask) {
+this.addr = address;
+this.mask = mask;
+  }
+
+  public CIDR(String cidr) throws IllegalArgumentException {
+String ipStr = cidr;
+int sep = cidr.indexOf('/');
+if (sep > -1) {
+  ipStr = cidr.substring(0, sep);
+}
+addr = InetAddresses.forString(ipStr);
+if (sep > -1) {
+  mask = Integer.parseInt(cidr.substring(sep + 1));
+} else {
+  mask = addr.getAddress().length * 8;
+}
+if (cidr.indexOf(':') > -1 && addr.getAddress().length == 4) {
+  // IPv4-mapped IPv6 addresses are automatically converted to IPv4,
+  // need to shift the mask
+  mask = Math.max(0, mask - 96);
+}
+  }
+
+  public boolean contains(InetAddress address) {
+byte[] addr0 = addr.getAddress();
+byte[] addr1 = address.getAddress();
+if (addr0.length != addr1.length) {
+  // not comparing IPv4 and IPv6 addresses
+  return false;
+}
+for (int i = 0; i < addr0.length; i++) {
+  int remainingMaskBits = mask - (i * 8);
+  if (remainingMaskBits <= 0)
+return true;
+  int m = ~(0xff >> remainingMaskBits); // mask for byte under cursor
+  if ((ad

[nutch] branch master updated (c0f723e99 -> 05afebd03)

2022-08-19 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from c0f723e99 NUTCH-2957 indexer-solr / Solr schema.xml - add fall-back 
field definitions for unknown index fields - update comments and descriptions - 
fix indentation
 new dfe430b5a NUTCH-2861 Remove parse-swf
 new 1aec06f41 Upgrade to Apache Rat 0.14 (download of Rat 0.13 failed)
 new ddca1c252 NUTCH-2822 Split the LICENSE.txt file into two files for 
source resp. binary releases
 new eba8f3842 NUTCH-2290 Update licenses of bundled libraries - update 
year in NOTICE files: follow schema used by Hadoop and Spark   projects 
(" and onwards") - change "developed by The ASF" -> 
"developed at The ASF"   following 
https://infra.apache.org/licensing-howto.html#bundle-asf-product
 new a10713114 NUTCH-2290 Update licenses of bundled libraries - move 
"export control notice" from README to NOTICE files   (following the schema 
used by Hadoop and Spark) - update "export control notice" following the scheme 
  used by Apache Tika
 new 1d1eb6360 NUTCH-2290 Update licenses of bundled libraries - ivy 
license report: add homepage URL of dependencies
 new 2fbd30976 NUTCH-2290 Update licenses of bundled libraries NUTCH-2821 
Deduplicate licenses in LICENSE.txt file - LICENSE-binary: list dependencies by 
license   (this also deduplicates licenses)
 new 78f6f4058 NUTCH-2290 Update licenses of bundled libraries - 
NOTICE-binary: add Apache projects and links to   the projects' NOTICE files - 
NOTICE-binary: add other software projects   with links to the project homepage 
and   the used license - add all licenses (different from the Apache 2.0 
license)   used by dependencies shipped in the binary package
 new 9a59ec9f0 NUTCH-2290 Update licenses of bundled libraries UTCH-2822 
Split the LICENSE.txt file into two files for source resp. binary releases - 
ensure the binary license and notice files are shipped with the source   and 
binary packages
 new 957d460c8 NUTCH-2290 Update licenses of bundled libraries - update the 
pull-request template and add updating licenses as a potential to-do
 new 05afebd03 Merge pull request #743 from 
sebastian-nagel/NUTCH-2290-update-licenses

The 3319 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .github/pull_request_template.md   |3 +
 CHANGES.txt|6 +
 LICENSE-binary |  786 +++
 LICENSE.txt| 5534 
 NOTICE-binary  | 1170 +
 NOTICE.txt |   37 +-
 README.md  |   23 -
 build.xml  |   18 +-
 conf/parse-plugins.xml.template|6 -
 default.properties |1 -
 ivy/ivy-report-license.xsl |4 +-
 licenses-binary/LICENSE-bouncy-castle-licence.txt  |   17 +
 licenses-binary/LICENSE-bsd-2-clause.txt   |   18 +
 licenses-binary/LICENSE-bsd-3-clause.txt   |   41 +
 licenses-binary/LICENSE-bsd.txt|  206 +
 licenses-binary/LICENSE-cddl-1.0.txt   |  175 +
 licenses-binary/LICENSE-cddl-1.1.txt   |  756 +++
 licenses-binary/LICENSE-cddl-gplv2-ce.txt  | 3176 +++
 licenses-binary/LICENSE-cddl-license.txt   |  175 +
 licenses-binary/LICENSE-common-public-license.txt  |  220 +
 licenses-binary/LICENSE-cpl.txt|  217 +
 .../LICENSE-eclipse-distribution-license-v1.0.txt  |   30 +
 licenses-binary/LICENSE-epl-2.0.txt|   90 +
 ...version-2-gpl2-with-the-classpath-exception.txt |   15 +
 ...y-extreme-lab-software-license-vesion-1.1.1.txt |0
 licenses-binary/LICENSE-mit-license.txt|   10 +
 .../LICENSE-mozilla-public-license-1.1-mpl-1.1.txt |  379 ++
 .../LICENSE-mozilla-public-license-version-2.0.txt |  375 ++
 ...ENSE-public-domain-per-creative-commons-cc0.txt |   32 +
 licenses-binary/LICENSE-public-domain.txt  |   18 +
 licenses-binary/LICENSE-the-go-license.txt |   29 +
 licenses-binary/LICENSE-unicode-icu-license.txt|  521 ++
 licenses-binary/LICENSE-unrar-license.txt  |   43 +
 src/plugin/build.xml   |3 -
 src/plugin/parse-swf/build.xml |   38 -
 src/plugin/parse-swf/ivy.xml   |   41 -
 src/plugin/parse-swf/lib/javaswf-LICENSE.txt   |   33 -
 src/plugin/parse-swf/lib/javaswf.ja

[nutch] branch master updated (edebfe49f -> c0f723e99)

2022-08-17 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from edebfe49f NUTCH-2955 indexer-solr: replace deprecated/removed field 
type solr.LatLonType
 add c0f723e99 NUTCH-2957 indexer-solr / Solr schema.xml - add fall-back 
field definitions for unknown index fields - update comments and descriptions - 
fix indentation

No new revisions were added by this update.

Summary of changes:
 src/plugin/indexer-solr/schema.xml | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)



[nutch] branch master updated (a5a630055 -> edebfe49f)

2022-08-17 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from a5a630055 Merge pull request #729 from 
sebastian-nagel/NUTCH-2947-keep-stateful-fetch-queues
 add edebfe49f NUTCH-2955 indexer-solr: replace deprecated/removed field 
type solr.LatLonType

No new revisions were added by this update.

Summary of changes:
 src/plugin/indexer-solr/schema.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



[nutch] branch master updated (82f9530dc -> a5a630055)

2022-08-15 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from 82f9530dc Merge pull request #697 from 
sebastian-nagel/NUTCH-2896-okhttp-connection-pool
 new c862d2409 NUTCH-2947 Fetcher: keep state of empty but stateful fetch 
queues unless queue feeder is finished in order to ensure politeness - next 
fetch time not yet reached - non-zero exception counter and queue feeder still  
 adding new fetch items to queues Only if the the queue feeder is finished and 
no more new fetch items are added, these queues can finally removed.
 new 8cfa53f7d NUTCH-2947 Fetcher: keep state of empty but stateful fetch 
queues - also keep state if `fetcher.exceptions.per.queue.delay` > 0.0
 new a5a630055 Merge pull request #729 from 
sebastian-nagel/NUTCH-2947-keep-stateful-fetch-queues

The 3306 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../org/apache/nutch/fetcher/FetchItemQueues.java | 19 +--
 src/java/org/apache/nutch/fetcher/QueueFeeder.java|  4 +++-
 2 files changed, 20 insertions(+), 3 deletions(-)



[nutch] branch master updated (b7b834501 -> 82f9530dc)

2022-08-15 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from b7b834501 NUTCH-2958 Upgrade to crawler-commons 1.3 (#740)
 new af44bcb6f NUTCH-2896 Protocol-okhttp: make connection pool 
configurable - add configuration property `http.connection.pool.okhttp` to 
configure   the number of connection pools, their size and the keep-alive time  
 of the pooled connections - create as many clients as pools are configured, 
each client holding   one pool. Distribute connections by target host name over 
clients
 new 467e59105 NUTCH-2896 Protocol-okhttp: make connection pool 
configurable - fix javadoc error
 new 82f9530dc Merge pull request #697 from 
sebastian-nagel/NUTCH-2896-okhttp-connection-pool

The 3303 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 conf/nutch-default.xml | 21 
 .../org/apache/nutch/protocol/okhttp/OkHttp.java   | 59 --
 .../nutch/protocol/okhttp/OkHttpResponse.java  |  2 +-
 3 files changed, 77 insertions(+), 5 deletions(-)



[nutch] branch master updated (8fc4f17ac -> b7b834501)

2022-08-12 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from 8fc4f17ac NUTCH-2956 index-geoip: dependency upgrades and improvements 
- upgrade to geoip2 3.0.1 - exclude transitive dependencies (Jackson) provided 
as Nutch core deps - read also GeoLite2-*.mmdb files - review index field names 
in plugin and Nutch Solr schema:   - fix typos in field names   - remove unused 
fields from schema
 add b7b834501 NUTCH-2958 Upgrade to crawler-commons 1.3 (#740)

No new revisions were added by this update.

Summary of changes:
 ivy/ivy.xml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)



[nutch] branch master updated: NUTCH-2956 index-geoip: dependency upgrades and improvements - upgrade to geoip2 3.0.1 - exclude transitive dependencies (Jackson) provided as Nutch core deps - read als

2022-08-09 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 8fc4f17ac NUTCH-2956 index-geoip: dependency upgrades and improvements 
- upgrade to geoip2 3.0.1 - exclude transitive dependencies (Jackson) provided 
as Nutch core deps - read also GeoLite2-*.mmdb files - review index field names 
in plugin and Nutch Solr schema:   - fix typos in field names   - remove unused 
fields from schema
8fc4f17ac is described below

commit 8fc4f17acc5da28c22ef4e77c2316e20e5976f02
Author: Sebastian Nagel 
AuthorDate: Sat Aug 6 15:04:10 2022 +0200

NUTCH-2956 index-geoip: dependency upgrades and improvements
- upgrade to geoip2 3.0.1
- exclude transitive dependencies (Jackson) provided as Nutch core deps
- read also GeoLite2-*.mmdb files
- review index field names in plugin and Nutch Solr schema:
  - fix typos in field names
  - remove unused fields from schema
---
 conf/nutch-default.xml |  3 +-
 src/plugin/index-geoip/ivy.xml | 11 +++--
 src/plugin/index-geoip/plugin.xml  |  7 +---
 .../nutch/indexer/geoip/GeoIPDocumentCreator.java  | 49 --
 .../nutch/indexer/geoip/GeoIPIndexingFilter.java   | 34 ---
 src/plugin/indexer-solr/schema.xml |  3 +-
 6 files changed, 57 insertions(+), 50 deletions(-)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 7faa6fdcd..bb9aae1b3 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -2112,7 +2112,8 @@ Add scoring-metadata to the list of active plugins
   'domainDatabase', 'ispDatabase' or 'insightsService'. If you wish to use any 
one of the 
   Database options, you should make one of GeoIP2-City.mmdb, 
GeoIP2-Connection-Type.mmdb, 
   GeoIP2-Domain.mmdb or GeoIP2-ISP.mmdb files respectively available on the 
classpath and
-  available at runtime.
+  available at runtime. Alternatively, also the GeoLite2 IP databases 
(GeoLite2-*.mmdb)
+  can be used.
   
 
 
diff --git a/src/plugin/index-geoip/ivy.xml b/src/plugin/index-geoip/ivy.xml
index 4fa6f71a7..2eda5a63f 100644
--- a/src/plugin/index-geoip/ivy.xml
+++ b/src/plugin/index-geoip/ivy.xml
@@ -36,12 +36,11 @@
   
 
   
-
-  
-  
-  
-  
-  
+
+  
+  
+  
+  
 
   
   
diff --git a/src/plugin/index-geoip/plugin.xml 
b/src/plugin/index-geoip/plugin.xml
index 6148f59e5..c4efadf94 100644
--- a/src/plugin/index-geoip/plugin.xml
+++ b/src/plugin/index-geoip/plugin.xml
@@ -25,11 +25,8 @@
   
  
   
-  
-  
-  
-  
-  
+  
+  

 

diff --git 
a/src/plugin/index-geoip/src/java/org/apache/nutch/indexer/geoip/GeoIPDocumentCreator.java
 
b/src/plugin/index-geoip/src/java/org/apache/nutch/indexer/geoip/GeoIPDocumentCreator.java
index 1c697a205..64b3862be 100644
--- 
a/src/plugin/index-geoip/src/java/org/apache/nutch/indexer/geoip/GeoIPDocumentCreator.java
+++ 
b/src/plugin/index-geoip/src/java/org/apache/nutch/indexer/geoip/GeoIPDocumentCreator.java
@@ -17,13 +17,17 @@
 package org.apache.nutch.indexer.geoip;
 
 import java.io.IOException;
+import java.lang.invoke.MethodHandles;
 import java.net.InetAddress;
 import java.net.UnknownHostException;
 
 import org.apache.nutch.indexer.NutchDocument;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 import com.maxmind.geoip2.DatabaseReader;
 import com.maxmind.geoip2.WebServiceClient;
+import com.maxmind.geoip2.exception.AddressNotFoundException;
 import com.maxmind.geoip2.exception.GeoIp2Exception;
 import com.maxmind.geoip2.model.InsightsResponse;
 import com.maxmind.geoip2.model.CityResponse;
@@ -54,28 +58,17 @@ import com.maxmind.geoip2.record.Traits;
  */
 public class GeoIPDocumentCreator {
 
-  /**
-   * Add field to document but only if value isn't null
-   * @param doc the {@link NutchDocument} to augment
-   * @param name the name of the target field
-   * @param value the String value to associate with the target field
-   */
-  public static void addIfNotNull(NutchDocument doc, String name,
-  String value) {
-if (value != null) {
-  doc.add(name, value);
-}
-  }
+  private static final Logger LOG = LoggerFactory
+  .getLogger(MethodHandles.lookup().lookupClass());
 
   /**
* Add field to document but only if value isn't null
* @param doc the {@link NutchDocument} to augment
* @param name the name of the target field
-   * @param value the {@link java.lang.Integer} value to 
-   * associate with the target field
+   * @param value the String value to associate with the target field
*/
   public static void addIfNotNull(NutchDocument doc, String name,
-  Integer value) {
+  Object value) {
 if (value != null) {
   doc.add(name, value);
 }
@@ -87,7 +80,6 @@ public class

[nutch] branch master updated: NUTCH-2953 Indexer Elastic to ignore SSL issues - apply patch contributed by Markus Jelsma - fix class imports

2022-08-09 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 01ab00b6c NUTCH-2953 Indexer Elastic to ignore SSL issues - apply 
patch contributed by Markus Jelsma - fix class imports
01ab00b6c is described below

commit 01ab00b6cd8dbba8abbf1d3840a09bab929c6af0
Author: Sebastian Nagel 
AuthorDate: Mon Aug 8 16:19:24 2022 +0200

NUTCH-2953 Indexer Elastic to ignore SSL issues
- apply patch contributed by Markus Jelsma
- fix class imports
---
 .../indexwriter/elastic/ElasticIndexWriter.java| 31 +-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git 
a/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
 
b/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
index 7885a5210..053bfd68a 100644
--- 
a/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
+++ 
b/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
@@ -25,14 +25,20 @@ import java.util.List;
 import java.util.Map;
 import java.util.concurrent.TimeUnit;
 
+import javax.net.ssl.SSLContext;
+
 import org.apache.commons.lang.StringUtils;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.http.HttpHost;
 import org.apache.http.auth.AuthScope;
 import org.apache.http.auth.UsernamePasswordCredentials;
 import org.apache.http.client.CredentialsProvider;
+import org.apache.http.conn.ssl.NoopHostnameVerifier;
+import org.apache.http.conn.ssl.TrustSelfSignedStrategy;
 import org.apache.http.impl.client.BasicCredentialsProvider;
 import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
+import org.apache.http.ssl.SSLContextBuilder;
+import org.apache.http.ssl.SSLContexts;
 import org.apache.nutch.indexer.IndexWriter;
 import org.apache.nutch.indexer.IndexWriterParams;
 import org.apache.nutch.indexer.NutchDocument;
@@ -181,6 +187,7 @@ public class ElasticIndexWriter implements IndexWriter {
 hostsList[i++] = new HttpHost(host, port, scheme);
   }
   RestClientBuilder restClientBuilder = RestClient.builder(hostsList);
+
   if (auth) {
 restClientBuilder
 .setHttpClientConfigCallback(new HttpClientConfigCallback() {
@@ -191,6 +198,28 @@ public class ElasticIndexWriter implements IndexWriter {
   }
 });
   }
+
+  // In case of HTTPS, set the client up for ignoring problems with 
self-signed
+  // certificates and stuff
+  if ("https".equals(scheme)) {
+try {
+  SSLContextBuilder sslBuilder = SSLContexts.custom();
+  sslBuilder.loadTrustMaterial(null, new TrustSelfSignedStrategy());
+  final SSLContext sslContext = sslBuilder.build();
+
+  restClientBuilder.setHttpClientConfigCallback(new 
HttpClientConfigCallback() {
+@Override
+public HttpAsyncClientBuilder 
customizeHttpClient(HttpAsyncClientBuilder httpClientBuilder) {
+  // ignore issues with self-signed certificates
+  
httpClientBuilder.setSSLHostnameVerifier(NoopHostnameVerifier.INSTANCE);
+  return httpClientBuilder.setSSLContext(sslContext);
+}
+  });
+} catch (Exception e) {
+  LOG.error("Error setting up SSLContext because: " + e.getMessage(), 
e);
+}
+  }
+
   client = new RestHighLevelClient(restClientBuilder);
 } else {
   throw new IOException(
@@ -344,4 +373,4 @@ public class ElasticIndexWriter implements IndexWriter {
   public Configuration getConf() {
 return config;
   }
-}
\ No newline at end of file
+}



[nutch] branch master updated: NUTCH-2952 Upgrade core dependencies - Hadoop 3.1.3 -> 3.3.3 - log4j 2.17.0 -> 2.17.2 - and some more

2022-08-09 Thread snagel
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new e71841fd0 NUTCH-2952 Upgrade core dependencies - Hadoop 3.1.3 -> 3.3.3 
- log4j 2.17.0 -> 2.17.2 - and some more
e71841fd0 is described below

commit e71841fd0f1777ece6dde2115ea7c5b036bb13f1
Author: Sebastian Nagel 
AuthorDate: Wed Jun 15 17:07:07 2022 +0200

NUTCH-2952 Upgrade core dependencies
- Hadoop 3.1.3 -> 3.3.3
- log4j 2.17.0 -> 2.17.2
- and some more
---
 ivy/ivy.xml | 40 +
 src/plugin/publish-rabbitmq/ivy.xml |  2 +-
 2 files changed, 19 insertions(+), 23 deletions(-)

diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index a03bce45f..12fa6d94c 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -36,10 +36,10 @@

 

-   
-   
-   
-   
+   
+   
+   
+   
 


@@ -50,7 +50,7 @@

 

-   
+   



@@ -58,23 +58,23 @@



-   
-   
-   
+   
+   
+   

 

 

-   
+   
 
-   
+   
 
-   
+   
 

 
-   
+   



@@ -84,10 +84,10 @@



-   
-   
-   
-   
+   
+   
+   
+   
 


@@ -111,16 +111,12 @@



-   
 
-   
-   
+   
+   
 

 
-   
-   
-



diff --git a/src/plugin/publish-rabbitmq/ivy.xml 
b/src/plugin/publish-rabbitmq/ivy.xml
index dd450cf7f..7b5e3dd3c 100644
--- a/src/plugin/publish-rabbitmq/ivy.xml
+++ b/src/plugin/publish-rabbitmq/ivy.xml
@@ -34,5 +34,5 @@
 
 
   
-  
+
 



  1   2   3   4   5   6   7   8   9   >