svn commit: r68410 [1/3] - /dev/nutch/1.20/

2024-04-09 Thread lewismc
Author: lewismc
Date: Tue Apr  9 20:44:40 2024
New Revision: 68410

Log:
Stage Apache Nutch 1.20  RC#1

Added:
dev/nutch/1.20/
dev/nutch/1.20/CHANGES.md
dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz   (with props)
dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz.asc
dev/nutch/1.20/apache-nutch-1.20-bin.zip   (with props)
dev/nutch/1.20/apache-nutch-1.20-bin.zip.asc
dev/nutch/1.20/apache-nutch-1.20-src.tar.gz   (with props)
dev/nutch/1.20/apache-nutch-1.20-src.tar.gz.asc
dev/nutch/1.20/apache-nutch-1.20-src.zip   (with props)
dev/nutch/1.20/apache-nutch-1.20-src.zip.asc



svn commit: r68410 [2/3] - /dev/nutch/1.20/

2024-04-09 Thread lewismc


Added: dev/nutch/1.20/CHANGES.md
==
--- dev/nutch/1.20/CHANGES.md (added)
+++ dev/nutch/1.20/CHANGES.md Tue Apr  9 20:44:40 2024
@@ -0,0 +1,3437 @@
+# Nutch Change Log
+
+
+Nutch 1.20 Release 09/04/2024 (dd/mm/)
+Release Report: https://s.apache.org/ovjf3
+
+Sub-task
+
+
+[NUTCH-2596] -   
  Upgrade from org.mortbay.jetty to org.eclipse.jetty
+
+[NUTCH-2852] -   
  Method invokes System.exit(...) 9 bugs
+
+[NUTCH-2972] -   
  Javadoc build fails using JDK 17
+
+[NUTCH-3007] -   
  Fix impossible casts
+
+
+
+Bug
+
+
+[NUTCH-2634] -   
  Some links marked as nofollow are followed anyway.
+
+[NUTCH-2820] -   
  Review sample files used in any23 unit tests
+
+[NUTCH-2924] -   
  Generate maxCount expr evaluated only once
+
+[NUTCH-2937] -   
  parse-tika: review dependency exclusions and avoid dependency conflicts in 
distributed mode
+
+[NUTCH-2973] -   
  Single domain names (eg https://localnet) cant be crawled - filtering 
fails
+
+[NUTCH-2974] -   
  Ant build fails with Unparseable date on certain platforms
+
+[NUTCH-2979] -   
  Upgrade Commons Text to 1.10.0
+
+[NUTCH-2982] -   
  Generator: parameter for URL normalization not passed forward
+
+[NUTCH-2985] -   
  Disable plugin urlfilter-validator by default
+
+[NUTCH-2992] -   
  Fetcher: always block fetch queues when exceptions threshold is reached
+
+[NUTCH-3000] -   
  protocol-selenium returns only the body,strips off the head/ element
+
+[NUTCH-3001] -   
  protocol-selenium requires Content-Type header 
+
+[NUTCH-3002] -   
  Protocol-okhttp HttpResponse: HTTP header metadata lookup should be 
case-insensitive
+
+[NUTCH-3008] -   
  indexer-elastic: downgrade to ES 7.10.2 to address licensing issues
+
+[NUTCH-3012] -   
  SegmentReader when dumping with option -recode: NPE on unparsed documents
+
+[NUTCH-3027] -   
  Trivial resource leak patch in DomainSuffixes.java
+
+[NUTCH-3035] -   
  Update license and notice file for release of 1.20 
+
+
+
+New Feature
+
+
+[NUTCH-2832] -   
  Create tutorial on sending Nutch logs to Elasticsearch
+
+[NUTCH-2888] -   
  Selenium Protocol: Support for Selenium 4
+
+[NUTCH-2920] -   
  Implement a indexer-opensearch plugin
+
+[NUTCH-2991] -   
  Support HTTP/S Header Authorization for Solr connections
+
+[NUTCH-3029] -   
  Host specific max. and min. intervals in adaptive scheduler
+
+
+
+Improvement
+
+
+[NUTCH-2853] -   
  bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean
+
+[NUTCH-2883] -   
  Provide means to run server as a persistent service in Docker container
+
+[NUTCH-2897] -   
  Do not supress deprecated API warnings
+
+[NUTCH-2961] -   
  Upgrade dependencies of parsefilter-naivebayes
+
+[NUTCH-2980] -   
  Upgrade Selenium Java to 4.7.2
+
+[NUTCH-2983] -   
  nutch-default.xml improvements
+
+[NUTCH-2990] -   
  HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
+
+[NUTCH-2993] -   
  ScoringDepth plugin to skip depth check based on URL Pattern
+
+[NUTCH-2995] -   
  Upgrade to crawler-commons 1.4
+
+[NUTCH-2996] -   
  Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)
+
+[NUTCH-2997] -   
  Add Override annotations where applicable
+
+[NUTCH-3004] -   
  Avoid NPE in HttpResponse
+
+[NUTCH-3005] -   
  Upgrade selenium as needed
+
+[NUTCH-3009] -   
  Upgrade to Hadoop 3.3.6
+
+[NUTCH-3010] -   
  Injector: count unique number of injected URLs
+
+[NUTCH-3011] -   
  HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors 
(HTTP 5xx)
+
+[NUTCH-3013] -   
  Employ commons-lang3s StopWatch to simplify timing logic
+
+[NUTCH-3014] -   
  Standardize Job names
+
+[NUTCH-3015] -   
  Add more CI steps to GitHub master-build.yml
+
+[NUTCH-3017] -   
  Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
+
+[NUTCH-3025] -   
  urlfilter-fast to filter based on the length of the URL
+
+[NUTCH-3031] -   
  ProtocolFactory host mapper to support domains
+
+[NUTCH-3032] -   
  Indexing plugin as an adapter for end users own POJO instances
+
+[NUTCH-3036] -   
  Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
+
+
+
+Task
+
+
+[NUTCH-2959] -   
  Upgrade to Apache Tika 2.9.0
+
+[NUTCH-2977] -   
  Support for showing dependency tree
+
+[NUTCH-2978] -   
  Move to slf4j2 and remove log4j1 and reload4j
+
+[NUTCH-2984] -   
  Drop test proxy server and benchmark tool
+
+[NUTCH-2989] -   
  Cant have username/pw AND https on elastic-indexer?!
+
+[NUTCH-2998] -   
  Remove the Any23 plugin
+
+[NUTCH-2999] -   
  Update Lucene version to latest 8.x
+
+[NUTCH-3016] -   
  Upgrade Apache Ivy to 2.5.2
+

svn commit: r68410 [3/3] - /dev/nutch/1.20/

2024-04-09 Thread lewismc
Added: dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz
==
Binary file - no diff available.

Propchange: dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz.asc
==
--- dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz.asc (added)
+++ dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz.asc Tue Apr  9 20:44:40 2024
@@ -0,0 +1,16 @@
+-BEGIN PGP SIGNATURE-
+
+iQIzBAABCAAdFiEE23tRmRIcCKXI9AUrOkcX8Ei66/YFAmYVpOQACgkQOkcX8Ei6
+6/buExAAwPh4uHBMGPvVUBLztSm5Ze+ZeRjHsxARVmiglyFUCKo9n1ZySHTaoqlW
+3f1I7c79dqrVZyqKMY9O5BjdA5K0w7scz3klHNOdrUc5Zal8GSY52sbOXq+CLka0
+fEYz3H3BMfB1eDn8F+dtFcYgfKqatVf+sFbvLdzfeorLzURZha/07WsGiXAtc629
+dOuNb9mweE5+BlEaeIm3ypYww294KZEvtQstouuvdal86Gm94KCenVb989CofQLb
+RHamuxjmVDOtb22G+PqCEFfPWZ3HSz9eOqzqn133glR88soWwG468MxzLAJZXpDU
+uB05ENvozkcIngj/emSZFy7Y1sY81VH0ErLxbxZDCIssxpVnOwI6N+5Un00T/nMz
+VbUeXv1Zq9XY2SHDZr9AP8wiWre4ae5wp2NAMVD2zlcTVo66jbDEiNSCzKmK/pPe
+gdexcS47lXQjCCYYe6rnUO8T5wEAeVn2Ctp+1mdjfDamN7liNExzvPtoUg07uDyx
+TM48F+5Es1c9wYC3nVyUvqadfKWFnCqfPIPogEeNTH5mwWTAtaXCcPcib+GxoCd+
+k5x5BEmB6wyQbmTKLjSVdDI6DL+suO4MtlIw1/2yHnj4uMPnAvABnG8uBKp2sCMc
+3GlQWJ5FiadkXASf6bbCv5+2iQof1BhRGJAu5PvYjRGEASG3IhM=
+=dpeR
+-END PGP SIGNATURE-

Added: dev/nutch/1.20/apache-nutch-1.20-bin.zip
==
Binary file - no diff available.

Propchange: dev/nutch/1.20/apache-nutch-1.20-bin.zip
--
svn:mime-type = application/octet-stream

Added: dev/nutch/1.20/apache-nutch-1.20-bin.zip.asc
==
--- dev/nutch/1.20/apache-nutch-1.20-bin.zip.asc (added)
+++ dev/nutch/1.20/apache-nutch-1.20-bin.zip.asc Tue Apr  9 20:44:40 2024
@@ -0,0 +1,16 @@
+-BEGIN PGP SIGNATURE-
+
+iQIzBAABCAAdFiEE23tRmRIcCKXI9AUrOkcX8Ei66/YFAmYVpUwACgkQOkcX8Ei6
+6/aj4RAAqeXW9QsddsFuxVu2el37aZhV4HOsGsCX66G/wxz5nj5s34O41IKxTPrv
+SJ0XRoekQ304uGYziAzDtDQUyXfAFo7gpF3w5TgK+5f8Mz8piPiW80uIMZYaUgXV
+kAr6dYlbLPtcbyzspxCBHFZlHPf0MC6YtnaHPFq5B9LBjLl3nE+u1HkCUlHjWm84
+dQqijPyaiFyYGhsuU4/xaAJcgluUNcQlmAcY6125vOtMGKJqHdTVU/rZvJ30Ym0V
+/k92t6+CgU4y8a/JyOToNFRD0f+3aGGNQUXKZIvAenzNIugv5wlubxF/CRht+J5L
+0bU48GcZjboNknKBc8tMewBwhHpAGAL5O5AS92j8naWUrZ1Wkur1y3EL7wiS39xJ
+fI0BRrTNcVapOoUnoQuXtxpoqRjiBmC2sEP9nH9T5dHNZaDljOielB4gi+1SGYYR
+DXiIpe6i/bMjMEO14At3ACwIoXknLo/gPQKUaIGQUTb+rlrFbZWVByZvcO826Az6
+0eEllycEzdvLpn0wv03zJhz9KwzJJCFJ4jgip/LIN5UXFHhUjzWykdJ2HUxHXq3v
+1zjee9o3/K0UqUn07d/rIG3pNdteja4PDo0AmLt2l/B8Pfi0pnZj9LjbL5DIWcNp
+oe41Ew6RFL7hjRZV2HwwBSmYCHNUSoL5HCR9dk10PcQFrH6phW0=
+=nfnj
+-END PGP SIGNATURE-

Added: dev/nutch/1.20/apache-nutch-1.20-src.tar.gz
==
Binary file - no diff available.

Propchange: dev/nutch/1.20/apache-nutch-1.20-src.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/nutch/1.20/apache-nutch-1.20-src.tar.gz.asc
==
--- dev/nutch/1.20/apache-nutch-1.20-src.tar.gz.asc (added)
+++ dev/nutch/1.20/apache-nutch-1.20-src.tar.gz.asc Tue Apr  9 20:44:40 2024
@@ -0,0 +1,16 @@
+-BEGIN PGP SIGNATURE-
+
+iQIzBAABCAAdFiEE23tRmRIcCKXI9AUrOkcX8Ei66/YFAmYVpPwACgkQOkcX8Ei6
+6/ZUGhAAjocHBJYQynpMuU+Geai8TC2sVBGUt33VuDPG5fHVnq5Y/QiwK3B/AL0u
+DtQdcajwnym3QMYBq0ZzzjOqXtE0B0Awwsz14KQYt+43AMpakLsVXBysZDXOTTcm
+yrSc3IJEYvxlDQg0DA9uU4qpw5AHcEP3gzQ5tqA8X9V0EWejf82+KRjpJmKwJi1j
+hS1rIdY0cCd15Ibo+jCf7PMSWZqYcEUdivy9+h1Zm+hV5mv49TMm4Js+fsNQrFyh
+2dS5EZSvommodgP4hjKCpW7EkNRcl20ZmlVntLNhULTEXDd8CCpweg/7iSNo0hD/
+MWS2YMtY2zf2lnid217YNhSG1a2LprZ3sqmMtEcM0/F8PsOrA1p1klsuTz6+S2FO
+ei89JdVQvOJbh6PdeaNkQqBTnc06seNQLTF+6iLtCPVQ3mojFJhqgnaMWP3W20A+
+ZElNLRe0Jw//5aX19YZilRoxAwA3aAxXSXIeNk9TukiRPOqvevxORDoXy3INosYj
+/8HrSESOXsZyCIyOQzHExYNDQA/SkH8BisxY9aVDDmJyaKTXgWAaraLVn1+/6thX
+zGhT3M349+bSrfR4BiMO7Cg3r0VcMgUkcfIUPfZtpLtOIV9bs+rGrxWlujor1vC6
+eS3hfSjMbQHLR3UuLMFRhWIAiunXAMHqnrRwWK20vOy5LiJo70I=
+=lrhO
+-END PGP SIGNATURE-

Added: dev/nutch/1.20/apache-nutch-1.20-src.zip
==
Binary file - no diff available.

Propchange: dev/nutch/1.20/apache-nutch-1.20-src.zip
--
svn:mime-type = application/octet-stream

Added: dev/nutch/1.20/apache-nutch-1.20-src.zip.asc
==
--- dev/nutch/1.20/apache-nutch-1.20-src.zip.asc (added)
+++ 

(nutch) annotated tag release-1.20 updated (a2cb6aa5d -> 6510cb241)

2024-04-09 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to annotated tag release-1.20
in repository https://gitbox.apache.org/repos/asf/nutch.git


*** WARNING: tag release-1.20 was modified! ***

from a2cb6aa5d (commit)
  to 6510cb241 (tag)
 tagging a2cb6aa5d3e90b7249e47323f2fa4cbf2aa9fa27 (commit)
 replaces release-1.13
  by Lewis John McGibbney
  on Tue Apr 9 09:44:29 2024 -0700

- Log -
Apache Nutch 1.20 RC#1 Tag
---


No new revisions were added by this update.

Summary of changes:



(nutch) branch branch-1.20 updated: Prepare Nutch 1.20 release candidate

2024-04-09 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch branch-1.20
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/branch-1.20 by this push:
 new a2cb6aa5d Prepare Nutch 1.20 release candidate
a2cb6aa5d is described below

commit a2cb6aa5d3e90b7249e47323f2fa4cbf2aa9fa27
Author: Lewis John McGibbney 
AuthorDate: Tue Apr 9 09:23:24 2024 -0700

Prepare Nutch 1.20 release candidate
---
 ivy/mvn.template | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ivy/mvn.template b/ivy/mvn.template
index fafc79f83..43ecfbd6a 100644
--- a/ivy/mvn.template
+++ b/ivy/mvn.template
@@ -45,7 +45,7 @@
 https://github.com/apache/nutch.git
   
 
-  2
+  
  
   maven2 
   https://repo.maven.apache.org/maven2/ 



(nutch) branch branch-1.20 created (now f141a398c)

2024-04-09 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch branch-1.20
in repository https://gitbox.apache.org/repos/asf/nutch.git


  at f141a398c Prepare Nutch 1.20 release candidate

This branch includes the following new commits:

 new f141a398c Prepare Nutch 1.20 release candidate

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.




(nutch) 01/01: Prepare Nutch 1.20 release candidate

2024-04-09 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch branch-1.20
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit f141a398c1c0c4e2a1861cd2928fff6a58f53b1f
Author: Lewis John McGibbney 
AuthorDate: Tue Apr 9 09:16:40 2024 -0700

Prepare Nutch 1.20 release candidate
---
 .gitignore |   2 +
 CHANGES.md | 157 +
 conf/nutch-default.xml |   2 +-
 default.properties |   4 +-
 src/bin/nutch  |   2 +-
 5 files changed, 163 insertions(+), 4 deletions(-)

diff --git a/.gitignore b/.gitignore
index 8c521aa68..972a7cfcb 100644
--- a/.gitignore
+++ b/.gitignore
@@ -26,3 +26,5 @@ lib/spotbugs-*
 ivy/dependency-check-ant/*
 .gradle*
 ivy/apache-rat-*
+ivy/maven-ant-tasks-*
+pom.xml
diff --git a/CHANGES.md b/CHANGES.md
index adea4478f..0e9a0cf45 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -1,5 +1,162 @@
 # Nutch Change Log
 
+
+Nutch 1.20 Release 09/04/2024 (dd/mm/)
+Release Report: https://s.apache.org/ovjf3
+
+Sub-task
+
+
+[NUTCH-2596] -   
  Upgrade from org.mortbay.jetty to org.eclipse.jetty
+
+[NUTCH-2852] -   
  Method invokes System.exit(...) 9 bugs
+
+[NUTCH-2972] -   
  Javadoc build fails using JDK 17
+
+[NUTCH-3007] -   
  Fix impossible casts
+
+
+
+Bug
+
+
+[NUTCH-2634] -   
  Some links marked as nofollow are followed anyway.
+
+[NUTCH-2820] -   
  Review sample files used in any23 unit tests
+
+[NUTCH-2924] -   
  Generate maxCount expr evaluated only once
+
+[NUTCH-2937] -   
  parse-tika: review dependency exclusions and avoid dependency conflicts in 
distributed mode
+
+[NUTCH-2973] -   
  Single domain names (eg https://localnet) cant be crawled - filtering 
fails
+
+[NUTCH-2974] -   
  Ant build fails with Unparseable date on certain platforms
+
+[NUTCH-2979] -   
  Upgrade Commons Text to 1.10.0
+
+[NUTCH-2982] -   
  Generator: parameter for URL normalization not passed forward
+
+[NUTCH-2985] -   
  Disable plugin urlfilter-validator by default
+
+[NUTCH-2992] -   
  Fetcher: always block fetch queues when exceptions threshold is reached
+
+[NUTCH-3000] -   
  protocol-selenium returns only the body,strips off the head/ element
+
+[NUTCH-3001] -   
  protocol-selenium requires Content-Type header 
+
+[NUTCH-3002] -   
  Protocol-okhttp HttpResponse: HTTP header metadata lookup should be 
case-insensitive
+
+[NUTCH-3008] -   
  indexer-elastic: downgrade to ES 7.10.2 to address licensing issues
+
+[NUTCH-3012] -   
  SegmentReader when dumping with option -recode: NPE on unparsed documents
+
+[NUTCH-3027] -   
  Trivial resource leak patch in DomainSuffixes.java
+
+[NUTCH-3035] -   
  Update license and notice file for release of 1.20 
+
+
+
+New Feature
+
+
+[NUTCH-2832] -   
  Create tutorial on sending Nutch logs to Elasticsearch
+
+[NUTCH-2888] -   
  Selenium Protocol: Support for Selenium 4
+
+[NUTCH-2920] -   
  Implement a indexer-opensearch plugin
+
+[NUTCH-2991] -   
  Support HTTP/S Header Authorization for Solr connections
+
+[NUTCH-3029] -   
  Host specific max. and min. intervals in adaptive scheduler
+
+
+
+Improvement
+
+
+[NUTCH-2853] -   
  bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean
+
+[NUTCH-2883] -   
  Provide means to run server as a persistent service in Docker container
+
+[NUTCH-2897] -   
  Do not supress deprecated API warnings
+
+[NUTCH-2961] -   
  Upgrade dependencies of parsefilter-naivebayes
+
+[NUTCH-2980] -   
  Upgrade Selenium Java to 4.7.2
+
+[NUTCH-2983] -   
  nutch-default.xml improvements
+
+[NUTCH-2990] -   
  HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
+
+[NUTCH-2993] -   
  ScoringDepth plugin to skip depth check based on URL Pattern
+
+[NUTCH-2995] -   
  Upgrade to crawler-commons 1.4
+
+[NUTCH-2996] -   
  Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)
+
+[NUTCH-2997] -   
  Add Override annotations where applicable
+
+[NUTCH-3004] -   
  Avoid NPE in HttpResponse
+
+[NUTCH-3005] -   
  Upgrade selenium as needed
+
+[NUTCH-3009] -   
  Upgrade to Hadoop 3.3.6
+
+[NUTCH-3010] -   
  Injector: count unique number of injected URLs
+
+[NUTCH-3011] -   
  HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors 
(HTTP 5xx)
+
+[NUTCH-3013] -   
  Employ commons-lang3s StopWatch to simplify timing logic
+
+[NUTCH-3014] -   
  Standardize Job names
+
+[NUTCH-3015] -   
  Add more CI steps to GitHub master-build.yml
+
+[NUTCH-3017] -   
  Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
+
+[NUTCH-3025] -   
  urlfilter-fast to filter based on the length of the URL
+
+[NUTCH-3031] -   
  ProtocolFactory host mapper to support domains
+