This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch branch-1.21 in repository https://gitbox.apache.org/repos/asf/nutch.git
commit 65eb8857d0b7f2b8864c689dbad24bcb80d6d496 Author: Sebastian Nagel <sna...@apache.org> AuthorDate: Tue Jul 15 17:39:43 2025 +0200 Nutch 1.21 release - update current year in API docs etc. - update version number - update changes / release notes --- CHANGES.md | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++ conf/nutch-default.xml | 2 +- default.properties | 4 +-- src/bin/nutch | 2 +- 4 files changed, 76 insertions(+), 4 deletions(-) diff --git a/CHANGES.md b/CHANGES.md index ab839cc95..40cfc6093 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -1,5 +1,77 @@ # Nutch Change Log +## Nutch 1.21 Release 15/07/2025 (dd/mm/yyyy) +Release Report: https://s.apache.org/bs58y + +### Breaking Changes + +- LinkDB records now can hold metadata, see [NUTCH-3101](https://issues.apache.org/jira/browse/NUTCH-3101). + This requires that existing LinkDBs are created anew starting from the segments. Nutch 1.21 cannot process LinkDBs written with prior Nutch versions. + + +### New Feature + +- [NUTCH-2856](https://issues.apache.org/jira/browse/NUTCH-2856) - Implement a protocol-smb plugin based on hierynomus/smbj +- [NUTCH-3063](https://issues.apache.org/jira/browse/NUTCH-3063) - Support for "addBinaryContent" from REST API + +### Sub-task + +- [NUTCH-2812](https://issues.apache.org/jira/browse/NUTCH-2812) - Methods returning array may expose internal representation + +### Bug + +- [NUTCH-3039](https://issues.apache.org/jira/browse/NUTCH-3039) - Failure to handle ftp:// URLs +- [NUTCH-3044](https://issues.apache.org/jira/browse/NUTCH-3044) - Generator: NPE when extracting the host part of a URL fails +- [NUTCH-3055](https://issues.apache.org/jira/browse/NUTCH-3055) - README: fix Github "hub" commands +- [NUTCH-3057](https://issues.apache.org/jira/browse/NUTCH-3057) - Arbitrary indexer "leaks" previous value into a field processed after an exception +- [NUTCH-3060](https://issues.apache.org/jira/browse/NUTCH-3060) - Javadoc link broken on website +- [NUTCH-3066](https://issues.apache.org/jira/browse/NUTCH-3066) - Protocol plugin unit tests fail randomly +- [NUTCH-3067](https://issues.apache.org/jira/browse/NUTCH-3067) - Improve performance of FetchItemQueues if error state is preserved +- [NUTCH-3072](https://issues.apache.org/jira/browse/NUTCH-3072) - Fetcher to stop QueueFeeder if aborting with "hung threads" +- [NUTCH-3075](https://issues.apache.org/jira/browse/NUTCH-3075) - tld plugin makes injector crash +- [NUTCH-3078](https://issues.apache.org/jira/browse/NUTCH-3078) - Database is not unlocked when injector fails +- [NUTCH-3079](https://issues.apache.org/jira/browse/NUTCH-3079) - Dumping a segment fails unless it has been fetched and parsed +- [NUTCH-3087](https://issues.apache.org/jira/browse/NUTCH-3087) - Nutch crawling inconsistent on URLs with userinfo +- [NUTCH-3092](https://issues.apache.org/jira/browse/NUTCH-3092) - Replace all imports of commons-lang by commons-lang3 +- [NUTCH-3093](https://issues.apache.org/jira/browse/NUTCH-3093) - Ant target test-plugins to depend on compile-core-test +- [NUTCH-3094](https://issues.apache.org/jira/browse/NUTCH-3094) - Github tests to run if build configuration changes +- [NUTCH-3096](https://issues.apache.org/jira/browse/NUTCH-3096) - HostDB ResolverThread can create too many job counters +- [NUTCH-3097](https://issues.apache.org/jira/browse/NUTCH-3097) - Plugin indexer-elastic throws ClassNotFoundException due to invalid dependencies +- [NUTCH-3103](https://issues.apache.org/jira/browse/NUTCH-3103) - Improper fetch interval given as example +- [NUTCH-3106](https://issues.apache.org/jira/browse/NUTCH-3106) - Issue with SSLHandshakeException in v1.20 using protocol-http plugin and proxy +- [NUTCH-3108](https://issues.apache.org/jira/browse/NUTCH-3108) - Fix SLF4J Class Loader Conflict in language-identifier +- [NUTCH-3114](https://issues.apache.org/jira/browse/NUTCH-3114) - Avoid stale fetching when only URLs from queues blocked by the exponential backoff remain + +### Improvement + +- [NUTCH-1806](https://issues.apache.org/jira/browse/NUTCH-1806) - Delegate processing of URL domains to crawler commons +- [NUTCH-2157](https://issues.apache.org/jira/browse/NUTCH-2157) - Parent Issue for Addressing Miredot REST API Warnings +- [NUTCH-2771](https://issues.apache.org/jira/browse/NUTCH-2771) - Tests in nightly builds: speed up long runners +- [NUTCH-2976](https://issues.apache.org/jira/browse/NUTCH-2976) - SitemapProcessor: verify sitemap values added from sitemap to CrawlDB (priority, modification time and change frequency) +- [NUTCH-3043](https://issues.apache.org/jira/browse/NUTCH-3043) - Generator: count URLs rejected by URL filters +- [NUTCH-3058](https://issues.apache.org/jira/browse/NUTCH-3058) - Fetcher: counter for hung threads +- [NUTCH-3061](https://issues.apache.org/jira/browse/NUTCH-3061) - URL filters to log name of the rule file rules are read from +- [NUTCH-3062](https://issues.apache.org/jira/browse/NUTCH-3062) - protocol-okhttp: optionally record HTTP and SSL/TLS versions +- [NUTCH-3065](https://issues.apache.org/jira/browse/NUTCH-3065) - Format changelog as Markdown +- [NUTCH-3073](https://issues.apache.org/jira/browse/NUTCH-3073) - Address Java compiler warnings +- [NUTCH-3083](https://issues.apache.org/jira/browse/NUTCH-3083) - Add RobotRulesParser to bin/nutch +- [NUTCH-3086](https://issues.apache.org/jira/browse/NUTCH-3086) - Consolidate plugin extension names and IDs +- [NUTCH-3095](https://issues.apache.org/jira/browse/NUTCH-3095) - Update .gitignore to ignore Hadoop native libraries +- [NUTCH-3100](https://issues.apache.org/jira/browse/NUTCH-3100) - HostDB to support minimum records per host +- [NUTCH-3101](https://issues.apache.org/jira/browse/NUTCH-3101) - LinkDb's Inlink class to support metadata +- [NUTCH-3112](https://issues.apache.org/jira/browse/NUTCH-3112) - Utilize parameterized logging +- [NUTCH-3113](https://issues.apache.org/jira/browse/NUTCH-3113) - Group commands in bin/nutch command-line help +- [NUTCH-3115](https://issues.apache.org/jira/browse/NUTCH-3115) - Allow POJO in Arbitrary Indexer to access indexing objects in filter constrctor +- [NUTCH-3116](https://issues.apache.org/jira/browse/NUTCH-3116) - Minor dependency upgrades and update of license list and notice file + +### Task + +- [NUTCH-1942](https://issues.apache.org/jira/browse/NUTCH-1942) - Remove TopLevelDomain +- [NUTCH-3041](https://issues.apache.org/jira/browse/NUTCH-3041) - Address confusing logging in o.a.n.net.URLExemptionFilters +- [NUTCH-3054](https://issues.apache.org/jira/browse/NUTCH-3054) - Address deprecation of Node16 for all GitHub Actions +- [NUTCH-3084](https://issues.apache.org/jira/browse/NUTCH-3084) - Improve CI by filtering and separating plugin and core test execution + + ## Nutch 1.20 Release 09/04/2024 (dd/mm/yyyy) Release Report: https://s.apache.org/ovjf3 diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index 1fddade83..f21dff492 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -203,7 +203,7 @@ <property> <name>http.agent.version</name> - <value>Nutch-1.21-SNAPSHOT</value> + <value>Nutch-1.21</value> <description>A version string to advertise in the User-Agent header.</description> </property> diff --git a/default.properties b/default.properties index a7036786a..cd178d237 100644 --- a/default.properties +++ b/default.properties @@ -14,9 +14,9 @@ # limitations under the License. name=apache-nutch -version=1.21-SNAPSHOT +version=1.21 final.name=${name}-${version} -year=2024 +year=2025 basedir = ./ src.dir = ./src/java diff --git a/src/bin/nutch b/src/bin/nutch index bc602a45b..8570afc3a 100755 --- a/src/bin/nutch +++ b/src/bin/nutch @@ -61,7 +61,7 @@ done # if no args specified, show usage if [ $# = 0 ]; then - echo "nutch 1.21-SNAPSHOT" + echo "nutch 1.21" echo "Usage: nutch COMMAND [-Dproperty=value]... [command-specific args]..." echo "where COMMAND is one of:" echo " (Crawl commands)"