(nutch) branch master updated: NUTCH-3055 README: fix Github "hub" commands - replace "git" with "hub" were necessary - improve formatting of "contributing" steps
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new ca03d9b76 NUTCH-3055 README: fix Github "hub" commands - replace "git" with "hub" were necessary - improve formatting of "contributing" steps ca03d9b76 is described below commit ca03d9b76485b7c9d50dff2c3946bb8189daf5e1 Author: Sebastian Nagel AuthorDate: Tue Apr 30 11:01:45 2024 +0200 NUTCH-3055 README: fix Github "hub" commands - replace "git" with "hub" were necessary - improve formatting of "contributing" steps --- README.md | 23 +++ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 28acfe8c7..f1322aa5e 100644 --- a/README.md +++ b/README.md @@ -22,22 +22,21 @@ Contributing To contribute a patch, follow these instructions (note that installing [Hub](https://hub.github.com/) is not strictly required, but is recommended). -``` 0. Download and install hub.github.com 1. File JIRA issue for your fix at https://issues.apache.org/jira/projects/NUTCH/issues -- you will get issue id NUTCH-xxx where xxx is the issue ID. -2. git clone https://github.com/apache/nutch.git -3. cd nutch -4. git checkout -b NUTCH-xxx + - you will get issue id NUTCH- where is the issue ID. +2. `git clone https://github.com/apache/nutch.git` +3. `cd nutch` +4. `git checkout -b NUTCH-` 5. edit files (please try and include a test case if possible) -6. git status (make sure it shows what files you expected to edit) +6. `git status` (make sure it shows what files you expected to edit) 7. Make sure that your code complies with the [Nutch codeformatting template](https://raw.githubusercontent.com/apache/nutch/master/eclipse-codeformat.xml), which is basially two space indents -8. git add -9. git commit -m “fix for NUTCH-xxx contributed by ” -10. git fork -11. git push -u NUTCH-xxx -12. git pull-request -``` +8. `git add ` +9. `git commit -m "fix for NUTCH-xxx contributed by "` +10. `hub fork` (if hub is not installed, you can fork the project using the "fork" button on the [Nutch Github project page](https://github.com/apache/nutch)) +11. `git push -u NUTCH-` +12. `hub pull-request` (if hub is not installed, please follow the instructions how to [create a pull-request from a fork](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork)) + IDE setup =
(nutch) branch master updated (8abc78a65 -> bfa07df29)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from 8abc78a65 NUTCH-3041 Address confusing logging in o.a.n.net.URLExemptionFilters (#813) add 4b263533a NUTCH-3044 Generator: NPE when extracting the host part of a URL fails add 4729786e4 NUTCH-3044 Generator: NPE when extracting the host part of a URL fails - add unit test to proof that URLs without a host part do not cause errors add b153279ad NUTCH-3044 Generator: NPE when extracting the host part of a URL fails - replace deprecated method call - improve and format Javadoc new bfa07df29 Merge pull request #815 from sebastian-nagel/NUTCH-3044-generator-npe The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: src/java/org/apache/nutch/crawl/Generator.java | 140 ++--- src/test/org/apache/nutch/crawl/TestGenerator.java | 55 +++- 2 files changed, 150 insertions(+), 45 deletions(-)
(nutch) 01/01: Merge pull request #815 from sebastian-nagel/NUTCH-3044-generator-npe
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit bfa07df29f7b810365620abff06680eac9bcddf9 Merge: 8abc78a65 b153279ad Author: Sebastian Nagel AuthorDate: Tue May 28 13:55:23 2024 +0200 Merge pull request #815 from sebastian-nagel/NUTCH-3044-generator-npe NUTCH-3044 Generator: NPE when extracting the host part of a URL fails src/java/org/apache/nutch/crawl/Generator.java | 140 ++--- src/test/org/apache/nutch/crawl/TestGenerator.java | 55 +++- 2 files changed, 150 insertions(+), 45 deletions(-)
(nutch) branch master updated: NUTCH-3043 Generator: count URLs rejected by URL filters (#814)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 5f1330a03 NUTCH-3043 Generator: count URLs rejected by URL filters (#814) 5f1330a03 is described below commit 5f1330a03d136440a167a85da6cfe8ac4b3f61b9 Author: Sebastian Nagel AuthorDate: Tue May 14 17:38:25 2024 +0200 NUTCH-3043 Generator: count URLs rejected by URL filters (#814) - add counters URL_FILTERS_REJECTED and URL_FILTER_EXCEPTION - simplify logging statement - remove unnecessary cast - use parameterized logging --- src/java/org/apache/nutch/crawl/Generator.java | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/src/java/org/apache/nutch/crawl/Generator.java b/src/java/org/apache/nutch/crawl/Generator.java index 33f743a37..f57642a65 100644 --- a/src/java/org/apache/nutch/crawl/Generator.java +++ b/src/java/org/apache/nutch/crawl/Generator.java @@ -224,9 +224,12 @@ public class Generator extends NutchTool implements Tool { // If filtering is on don't generate URLs that don't pass // URLFilters try { - if (filters.filter(url.toString()) == null) + if (filters.filter(url.toString()) == null) { +context.getCounter("Generator", "URL_FILTERS_REJECTED").increment(1); return; + } } catch (URLFilterException e) { + context.getCounter("Generator", "URL_FILTER_EXCEPTION").increment(1); LOG.warn("Couldn't filter url: {} ({})", url, e.getMessage()); } } @@ -253,10 +256,7 @@ public class Generator extends NutchTool implements Tool { try { sort = scfilters.generatorSortValue(key, crawlDatum, sort); } catch (ScoringFilterException sfe) { -if (LOG.isWarnEnabled()) { - LOG.warn( - "Couldn't filter generatorSortValue for " + key + ": " + sfe); -} +LOG.warn("Couldn't filter generatorSortValue for {}: {}", key, sfe); } // check expr @@ -625,7 +625,7 @@ public class Generator extends NutchTool implements Tool { // make later bytes more significant in hash code, so that sorting // by hashcode correlates less with by-host ordering. for (int i = length - 1; i >= 0; i--) -hash = (31 * hash) + (int) bytes[start + i]; +hash = (31 * hash) + bytes[start + i]; return hash; } }
(nutch) branch master updated: NUTCH-3039 Failure to handle ftp:// URLs
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new ea9c7ee5d NUTCH-3039 Failure to handle ftp:// URLs ea9c7ee5d is described below commit ea9c7ee5d6635405b31b4a1d462cca746478b040 Author: Sebastian Nagel AuthorDate: Thu Apr 11 13:28:37 2024 +0200 NUTCH-3039 Failure to handle ftp:// URLs Pass ftp:// URLs to the standard JVM URLStreamHandler --- src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java b/src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java index bd7e377d0..0916f4c9d 100644 --- a/src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java +++ b/src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java @@ -72,9 +72,13 @@ public class URLStreamHandlerFactory * Protocols covered by standard JVM URL handlers. These protocols must not be * handled by Nutch plugins, in order to avoid that basic actions (eg. loading * of classes and configuration files) break. + * + * Also the "ftp" protocol is included: it's usually supported by the standard + * JVM URL handler and Nutch does not yet provide a dedicated URL stream + * handler. */ public static final String[] SYSTEM_PROTOCOLS = { // - "http", "https", "file", "jar" }; + "http", "https", "file", "jar", "ftp" }; static { instance = new URLStreamHandlerFactory();
(nutch-site) branch asf-site updated: Revert incorrect change in doap.rdf (see #2)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/asf-site by this push: new 8456fb5 Revert incorrect change in doap.rdf (see #2) 8456fb5 is described below commit 8456fb597e2dc3147312032298ac24d25a8a5632 Author: Sebastian Nagel AuthorDate: Sat May 11 20:30:51 2024 +0200 Revert incorrect change in doap.rdf (see #2) --- content/doap.rdf | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/doap.rdf b/content/doap.rdf index 186fe9a..0799f8f 100644 --- a/content/doap.rdf +++ b/content/doap.rdf @@ -33,7 +33,7 @@ https://nutch.apache.org/community/mailing-lists/; /> https://www.apache.org/dyn/closer.cgi/nutch/; /> Java -https://projects.apache.org/projects.html?category#web-framework; /> +http://projects.apache.org/category/web-framework; /> Apache Nutch 1.20
(nutch-site) branch asf-staging updated: Revert incorrect change in doap.rdf (see #2)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch asf-staging in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/asf-staging by this push: new d7ac03a Revert incorrect change in doap.rdf (see #2) d7ac03a is described below commit d7ac03a033e1db8f161e7dea236d482a2c2460ce Author: Sebastian Nagel AuthorDate: Sat May 11 20:27:43 2024 +0200 Revert incorrect change in doap.rdf (see #2) --- content/doap.rdf | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/doap.rdf b/content/doap.rdf index 186fe9a..0799f8f 100644 --- a/content/doap.rdf +++ b/content/doap.rdf @@ -33,7 +33,7 @@ https://nutch.apache.org/community/mailing-lists/; /> https://www.apache.org/dyn/closer.cgi/nutch/; /> Java -https://projects.apache.org/projects.html?category#web-framework; /> +http://projects.apache.org/category/web-framework; /> Apache Nutch 1.20
(nutch-site) branch main updated: Revert incorrect change (#2)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/main by this push: new c011a7e Revert incorrect change (#2) c011a7e is described below commit c011a7eec90ad4ded0ea3a028419f63666da3aa8 Author: Sebb AuthorDate: Sat May 11 19:24:51 2024 +0100 Revert incorrect change (#2) --- content/doap.rdf | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/doap.rdf b/content/doap.rdf index 186fe9a..0799f8f 100644 --- a/content/doap.rdf +++ b/content/doap.rdf @@ -33,7 +33,7 @@ https://nutch.apache.org/community/mailing-lists/; /> https://www.apache.org/dyn/closer.cgi/nutch/; /> Java -https://projects.apache.org/projects.html?category#web-framework; /> +http://projects.apache.org/category/web-framework; /> Apache Nutch 1.20
(nutch) branch master updated: NUTCH-3008 indexer-elastic: downgrade to ES 7.10.2 to address licensing issues
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 367988dfd NUTCH-3008 indexer-elastic: downgrade to ES 7.10.2 to address licensing issues 367988dfd is described below commit 367988dfd63751e05e10c93c4c32bd9f7c47b634 Author: Sebastian Nagel AuthorDate: Wed Mar 13 15:55:55 2024 +0100 NUTCH-3008 indexer-elastic: downgrade to ES 7.10.2 to address licensing issues --- src/plugin/indexer-elastic/howto_upgrade_es.md | 4 +-- src/plugin/indexer-elastic/ivy.xml | 2 +- src/plugin/indexer-elastic/plugin.xml | 37 +++--- .../indexwriter/elastic/ElasticIndexWriter.java| 2 +- 4 files changed, 22 insertions(+), 23 deletions(-) diff --git a/src/plugin/indexer-elastic/howto_upgrade_es.md b/src/plugin/indexer-elastic/howto_upgrade_es.md index b57e0c02f..ca58639d1 100644 --- a/src/plugin/indexer-elastic/howto_upgrade_es.md +++ b/src/plugin/indexer-elastic/howto_upgrade_es.md @@ -37,7 +37,7 @@ (eventually with different versions) - duplicated libs can be added to the exclusions of transitive dependencies in build/plugins/indexer-elastic/ivy.xml - - but it should be made sure that the library versions in ivy/ivy.xml correspend to + - but it should be made sure that the library versions in ivy/ivy.xml correspond to those required by Tika 5. Remove the locally "installed" dependencies in src/plugin/indexer-elastic/lib/: @@ -47,4 +47,4 @@ 6. Build Nutch and run all unit tests: $ cd ../../../ -$ ant clean runtime test \ No newline at end of file +$ ant clean runtime test diff --git a/src/plugin/indexer-elastic/ivy.xml b/src/plugin/indexer-elastic/ivy.xml index de59711a2..2a52fc62b 100644 --- a/src/plugin/indexer-elastic/ivy.xml +++ b/src/plugin/indexer-elastic/ivy.xml @@ -36,7 +36,7 @@ - + diff --git a/src/plugin/indexer-elastic/plugin.xml b/src/plugin/indexer-elastic/plugin.xml index fc3723a60..b4f872375 100644 --- a/src/plugin/indexer-elastic/plugin.xml +++ b/src/plugin/indexer-elastic/plugin.xml @@ -22,18 +22,17 @@ - - + - - - - - - - - - + + + + + + + + + @@ -43,10 +42,10 @@ - - + + - + @@ -58,12 +57,12 @@ - + - - - + + + @@ -74,4 +73,4 @@ - \ No newline at end of file + diff --git a/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java b/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java index 290d9dfca..0cb267463 100644 --- a/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java +++ b/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java @@ -149,7 +149,7 @@ public class ElasticIndexWriter implements IndexWriter { .builder( (request, bulkListener) -> client.bulkAsync(request, RequestOptions.DEFAULT, bulkListener), -bulkProcessorListener(), "nutch-indexer-elastic") +bulkProcessorListener()) .setBulkActions(maxBulkDocs) .setBulkSize(new ByteSizeValue(maxBulkLength, ByteSizeUnit.BYTES)) .setConcurrentRequests(1)
(nutch) branch master updated: Update crawl documentation
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 83acd501e Update crawl documentation 83acd501e is described below commit 83acd501e0a873c906fdb542e2c5ee86787a15a2 Author: Jakob Berlin AuthorDate: Thu Dec 14 16:23:11 2023 +0100 Update crawl documentation Show --dedup-group instead of -dedup-group which have lead to misunderstanding output --- src/bin/crawl | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/bin/crawl b/src/bin/crawl index db4221868..409f72799 100755 --- a/src/bin/crawl +++ b/src/bin/crawl @@ -48,7 +48,7 @@ # --time-limit-fetch Number of minutes allocated to the fetching [default: 180] # --num-threadsNumber of threads for fetching / sitemap processing [default: 50] # -# -dedup-groupDeduplication group method [default: none] +# --dedup-groupDeduplication group method [default: none] # function __to_seconds() { @@ -109,7 +109,7 @@ function __print_usage { echo -e " \t\t\t\t\t - never [default]" echo -e " \t\t\t\t\t - always (processing takes place in every iteration)" echo -e " \t\t\t\t\t - once (processing only takes place in the first iteration)" - echo -e " -dedup-group \tDeduplication group method [default: none]" + echo -e " --dedup-group \tDeduplication group method [default: none]" exit 1 }
(nutch) branch master updated (adadc43fb -> 7ad382d95)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from adadc43fb Merge branch 'NUTCH-3017', closes #793 new d8e66ce87 [NUTCH-3025^Curlfilter-fast to filter based on the length of the URL new d764e4c16 Added filtering on whole string + documented config in nutch-default + fixed tests new 49d85eac7 Merged changes from master; improved Javadoc and exception handling new 7ad382d95 Merge pull request #796 from DigitalPebble/NUTCH-3025 The 3415 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: conf/nutch-default.xml | 24 src/plugin/urlfilter-fast/README.md| 6 ++ .../apache/nutch/urlfilter/fast/FastURLFilter.java | 65 +- .../nutch/urlfilter/fast/TestFastURLFilter.java| 38 - 4 files changed, 129 insertions(+), 4 deletions(-)
(nutch) 02/02: Merge branch 'NUTCH-3017', closes #793
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit adadc43fb169793c47ab25a0eba99a5f20eda763 Merge: 90849124d ac383fc51 Author: Sebastian Nagel AuthorDate: Wed Nov 8 13:35:43 2023 +0100 Merge branch 'NUTCH-3017', closes #793 conf/nutch-default.xml | 10 ++-- .../apache/nutch/urlfilter/fast/FastURLFilter.java | 27 +++--- 2 files changed, 32 insertions(+), 5 deletions(-)
(nutch) 01/02: [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 and support gzipped input - use Hadoop-provided compression codecs - update description of property urlfilter.fast.file
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit ac383fc5125b6c114a23ef996558ead57e873970 Author: Sebastian Nagel AuthorDate: Wed Nov 8 12:24:24 2023 +0100 [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 and support gzipped input - use Hadoop-provided compression codecs - update description of property urlfilter.fast.file --- conf/nutch-default.xml | 10 -- .../org/apache/nutch/urlfilter/fast/FastURLFilter.java | 14 -- 2 files changed, 16 insertions(+), 8 deletions(-) diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index d8bf76486..b20afdfe3 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -1872,8 +1872,14 @@ CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this urlfilter.fast.file fast-urlfilter.txt - Name of file on CLASSPATH containing regular expressions - used by urlfilter-fast (FastURLFilter) plugin. + Name of file containing rules and regular expressions + used by urlfilter-fast (FastURLFilter) plugin. If the filename + includes a scheme (for example, hdfs://) it is loaded using the + Hadoop FileSystem implementation supporting that scheme. If the + filename does not contain a scheme, the file is loaded from + CLASSPATH. If indicated by file extension (.gz, .bzip2, .zst), + the file is decompressed while reading using Hadoop-provided + compression codecs. diff --git a/src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java b/src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java index 79ad7b6ca..bb4a11b7c 100644 --- a/src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java +++ b/src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java @@ -21,6 +21,8 @@ import com.google.common.collect.Multimap; import org.apache.commons.lang.StringUtils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.compress.CompressionCodec; +import org.apache.hadoop.io.compress.CompressionCodecFactory; import org.apache.hadoop.fs.FileSystem; import org.apache.nutch.net.URLFilter; import org.slf4j.Logger; @@ -35,7 +37,6 @@ import java.io.Reader; import java.net.URL; import java.util.regex.Pattern; import java.util.regex.PatternSyntaxException; -import java.util.zip.GZIPInputStream; /** * Filters URLs based on a file of regular expressions using host/domains @@ -120,7 +121,7 @@ public class FastURLFilter implements URLFilter { try { reloadRules(); } catch (Exception e) { - LOG.error(e.getMessage()); + LOG.error("Failed to load rules: {}", e.getMessage() ); throw new RuntimeException(e.getMessage(), e); } } @@ -193,13 +194,14 @@ public class FastURLFilter implements URLFilter { if (fileRulesPath.toUri().getScheme() != null) { FileSystem fs = fileRulesPath.getFileSystem(conf); is = fs.open(fileRulesPath); -} -else { +} else { is = conf.getConfResourceAsInputStream(fileRules); } -if (fileRules.endsWith(".gz")) { - is = new GZIPInputStream(is); +CompressionCodec codec = new CompressionCodecFactory(conf) +.getCodec(fileRulesPath); +if (codec != null) { + is = codec.createInputStream(is); } reloadRules(new InputStreamReader(is));
(nutch) branch master updated (90849124d -> adadc43fb)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from 90849124d NUTCH-3020 -- ParseSegment should check for okhttp's truncation flag (#794) add d1025fd63 [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 and support gzipped input new ac383fc51 [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 and support gzipped input - use Hadoop-provided compression codecs - update description of property urlfilter.fast.file new adadc43fb Merge branch 'NUTCH-3017', closes #793 The 2 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: conf/nutch-default.xml | 10 ++-- .../apache/nutch/urlfilter/fast/FastURLFilter.java | 27 +++--- 2 files changed, 32 insertions(+), 5 deletions(-)
[nutch] branch master updated: NUTCH-3012 SegmentReader when dumping with option -recode: NPE on unparsed documents - fall back to UTF-8 when stringifying the content of unparsed documents
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new d2c3e96d8 NUTCH-3012 SegmentReader when dumping with option -recode: NPE on unparsed documents - fall back to UTF-8 when stringifying the content of unparsed documents d2c3e96d8 is described below commit d2c3e96d88818d8107f320c49e007329b020e090 Author: Sebastian Nagel AuthorDate: Mon Oct 9 10:21:01 2023 +0200 NUTCH-3012 SegmentReader when dumping with option -recode: NPE on unparsed documents - fall back to UTF-8 when stringifying the content of unparsed documents --- src/java/org/apache/nutch/segment/SegmentReader.java | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/src/java/org/apache/nutch/segment/SegmentReader.java b/src/java/org/apache/nutch/segment/SegmentReader.java index 14546af54..ee5c266fd 100644 --- a/src/java/org/apache/nutch/segment/SegmentReader.java +++ b/src/java/org/apache/nutch/segment/SegmentReader.java @@ -163,13 +163,16 @@ public class SegmentReader extends Configured implements Tool { dump.append("\nRecno:: ").append(recNo++).append("\n"); dump.append("URL:: " + key.toString() + "\n"); Content content = null; - Charset charset = null; + // fall-back encoding for content of unparsed documents + Charset charset = StandardCharsets.UTF_8; for (NutchWritable val : values) { Writable value = val.get(); // unwrap if (value instanceof CrawlDatum) { dump.append("\nCrawlDatum::\n").append(((CrawlDatum) value).toString()); } else if (value instanceof Content) { if (recodeContent) { +// output recoded content later when charset is extracted from HTML +// metadata hold in ParseData content = (Content) value; } else { dump.append("\nContent::\n").append(((Content) value).toString());
[nutch] branch master updated: NUTCH-3011 HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new b081c75d8 NUTCH-3011 HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx) b081c75d8 is described below commit b081c75d87be61e42297c952298b72eb7ff2a6dc Author: Sebastian Nagel AuthorDate: Sun Oct 1 14:08:39 2023 +0200 NUTCH-3011 HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx) --- conf/nutch-default.xml| 11 ++- .../apache/nutch/protocol/http/api/HttpRobotRulesParser.java | 3 ++- 2 files changed, 8 insertions(+), 6 deletions(-) diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index 18ed56b03..d8bf76486 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -141,8 +141,9 @@ http.robots.503.defer.visits true Temporarily suspend fetching from a host if the - robots.txt response is HTTP 503 or any other 5xx server error. See - also http.robots.503.defer.visits.delay and + robots.txt response is HTTP 503 or any other 5xx server error + and HTTP 429 Too Many Requests. See also + http.robots.503.defer.visits.delay and http.robots.503.defer.visits.retries @@ -150,7 +151,7 @@ http.robots.503.defer.visits.delay 30 Time in milliseconds to suspend crawling a host if the - robots.txt response is HTTP 5xx - see + robots.txt response is HTTP 5xx or 429 Too Many Requests - see http.robots.503.defer.visits. @@ -158,8 +159,8 @@ http.robots.503.defer.visits.retries 3 Number of retries crawling a host if the robots.txt - response is HTTP 5xx - see http.robots.503.defer.visits. After n - retries the host queue is dropped for this segment/cycle. + response is HTTP 5xx or 429 - see http.robots.503.defer.visits. + After n retries the host queue is dropped for this segment/cycle. diff --git a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java index 8d7263e3e..ec5e77e43 100644 --- a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java +++ b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java @@ -229,7 +229,8 @@ public class HttpRobotRulesParser extends RobotRulesParser { else if ((code == 403) && (!allowForbidden)) robotRules = FORBID_ALL_RULES; // use forbid all -else if (code >= 500) { +else if (code >= 500 || code == 429) { + // 5xx server errors or 429 Too Many Requests cacheRule = false; // try again later to fetch robots.txt if (deferVisits503) { // signal fetcher to suspend crawling for this host
[nutch] branch master updated: NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309 (#779)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new ecdd19dbd NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309 (#779) ecdd19dbd is described below commit ecdd19dbdd4424bf9b9bce206f23992140ee43fe Author: Sebastian Nagel AuthorDate: Sat Oct 21 15:53:25 2023 +0200 NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309 (#779) - follow multiple redirects when fetching robots.txt - number of followed redirects is configurable by the property http.robots.redirect.max (default: 5) Improvements to RobotRulesParser's robots.txt test utility - bug fix: the passed agent names need to be transferred to the property http.robots.agents earlier, before the protocol plugins are configured - more verbose debug logging --- conf/nutch-default.xml | 10 ++ .../apache/nutch/protocol/RobotRulesParser.java| 32 +++-- .../protocol/http/api/HttpRobotRulesParser.java| 141 - 3 files changed, 143 insertions(+), 40 deletions(-) diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index 58455b338..18ed56b03 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -163,6 +163,16 @@ + + http.robots.redirect.max + 5 + Maximum number of redirects followed when fetching + a robots.txt file. RFC 9309 specifies that crawlers SHOULD + follow at least five consecutive redirects, even across authorities + (for example, hosts in the case of HTTP). + + + http.agent.description diff --git a/src/java/org/apache/nutch/protocol/RobotRulesParser.java b/src/java/org/apache/nutch/protocol/RobotRulesParser.java index 562c2c694..d73c07506 100644 --- a/src/java/org/apache/nutch/protocol/RobotRulesParser.java +++ b/src/java/org/apache/nutch/protocol/RobotRulesParser.java @@ -98,6 +98,7 @@ public abstract class RobotRulesParser implements Tool { protected Configuration conf; protected Set agentNames; + protected int maxNumRedirects = 5; /** set of host names or IPs to be explicitly excluded from robots.txt checking */ protected Set allowList = new HashSet<>(); @@ -149,6 +150,10 @@ public abstract class RobotRulesParser implements Tool { } } } +LOG.info("Checking robots.txt for the following agent names: {}", agentNames); + +maxNumRedirects = conf.getInt("http.robots.redirect.max", 5); +LOG.info("Following max. {} robots.txt redirects", maxNumRedirects); String[] confAllowList = conf.getStrings("http.robot.rules.allowlist"); if (confAllowList == null) { @@ -294,8 +299,11 @@ public abstract class RobotRulesParser implements Tool { "", "\tlocal file or URL parsed as robots.txt file", "\tIf starts with a protocol specification", - "\t(`http', `https', `ftp' or `file'), robots.txt it is fetched", - "\tusing the specified protocol. Otherwise, a local file is assumed.", + "\t(`http', `https', `ftp' or `file'), the URL is parsed, URL path", + "\tand query are removed and the path \"/robots.txt\" is appended.", + "\tThe resulting URL (the canonical robots.txt location) is then", + "\tfetched using the specified protocol.", + "\tIf the URL does not include a protocol, a local file is assumed.", "", "\tlocal file with URLs (one per line), for every URL", "\tthe path part (including the query) is checked whether", @@ -323,6 +331,16 @@ public abstract class RobotRulesParser implements Tool { return -1; } +if (args.length > 2) { + // set agent name from command-line in configuration + // Note: when fetching via protocol this must be done + // before the protocol is configured + String agents = args[2]; + conf.set("http.robots.agents", agents); + conf.set("http.agent.name", agents.split(",")[0]); + setConf(conf); +} + Protocol protocol = null; URL robotsTxtUrl = null; if (args[0].matches("^(?:https?|ftp|file)://?.*")) { @@ -334,6 +352,7 @@ public abstract class RobotRulesParser implements Tool { ProtocolFactory factory = new ProtocolFactory(conf); try { protocol = factory.getProtocol(robotsTxtUrl); +LOG.debug("Using protocol {} to fetch robots.txt", protocol.getClass()); } catch (ProtocolNotFound e) { LOG.error("No protocol found for {}: {}", args[0], StringUtils.stringifyException(e)); @@ -357
[nutch] branch master updated: NUTCH-3009 Upgrade to Hadoop 3.3.6
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new bb68385f9 NUTCH-3009 Upgrade to Hadoop 3.3.6 bb68385f9 is described below commit bb68385f9601b37c61ef5a2baac58740c975bddb Author: Sebastian Nagel AuthorDate: Thu Sep 28 14:53:02 2023 +0200 NUTCH-3009 Upgrade to Hadoop 3.3.6 --- default.properties | 2 +- ivy/ivy.xml| 8 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/default.properties b/default.properties index 17e0bffbb..06f2ed009 100644 --- a/default.properties +++ b/default.properties @@ -44,7 +44,7 @@ test.junit.output.format = plain javadoc.proxy.host=-J-DproxyHost= javadoc.proxy.port=-J-DproxyPort= javadoc.link.java=https://docs.oracle.com/en/java/javase/11/docs/api/ -javadoc.link.hadoop=https://hadoop.apache.org/docs/r3.3.4/api/ +javadoc.link.hadoop=https://hadoop.apache.org/docs/r3.3.6/api/ javadoc.packages=org.apache.nutch.* dist.dir=./dist diff --git a/ivy/ivy.xml b/ivy/ivy.xml index 6f3926244..e5ae3882f 100644 --- a/ivy/ivy.xml +++ b/ivy/ivy.xml @@ -53,19 +53,19 @@ - + - + - + - +
[nutch] branch master updated: NUTCH-3002 Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive - implement class CaseInsensitiveMetadata providing case-insensitive me
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new e96cfc56e NUTCH-3002 Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive - implement class CaseInsensitiveMetadata providing case-insensitive metadata look-ups (but no spell-checking) - use CaseInsensitiveMetadata to hold HTTP header metadata in in the class OkHttpResponse of protocol-okhttp - add unit tests to prove the fix (and also case-insensitive look-ups and spell-checking in protocol-http) e96cfc56e is described below commit e96cfc56ee04c8e7e07e11d4eef521b4674a9ec6 Author: Sebastian Nagel AuthorDate: Tue Sep 19 08:10:14 2023 +0200 NUTCH-3002 Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive - implement class CaseInsensitiveMetadata providing case-insensitive metadata look-ups (but no spell-checking) - use CaseInsensitiveMetadata to hold HTTP header metadata in in the class OkHttpResponse of protocol-okhttp - add unit tests to prove the fix (and also case-insensitive look-ups and spell-checking in protocol-http) --- .../nutch/metadata/CaseInsensitiveMetadata.java| 33 + src/java/org/apache/nutch/metadata/Metadata.java | 4 +- .../nutch/metadata/SpellCheckedMetadata.java | 8 +- .../org/apache/nutch/net/protocols/Response.java | 2 +- .../apache/nutch/protocol/http/TestResponse.java | 152 .../nutch/protocol/okhttp/OkHttpResponse.java | 3 +- .../apache/nutch/protocol/okhttp/TestResponse.java | 154 + 7 files changed, 348 insertions(+), 8 deletions(-) diff --git a/src/java/org/apache/nutch/metadata/CaseInsensitiveMetadata.java b/src/java/org/apache/nutch/metadata/CaseInsensitiveMetadata.java new file mode 100644 index 0..92e848ca2 --- /dev/null +++ b/src/java/org/apache/nutch/metadata/CaseInsensitiveMetadata.java @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.metadata; + +import java.util.TreeMap; + +/** + * A decorator to Metadata that adds for case-insensitive lookup of keys. + */ +public class CaseInsensitiveMetadata extends Metadata { + + /** + * Constructs a new, empty metadata. + */ + public CaseInsensitiveMetadata() { +metadata = new TreeMap<>(String.CASE_INSENSITIVE_ORDER); + } + +} diff --git a/src/java/org/apache/nutch/metadata/Metadata.java b/src/java/org/apache/nutch/metadata/Metadata.java index 5c37911fb..7fa0bb12c 100644 --- a/src/java/org/apache/nutch/metadata/Metadata.java +++ b/src/java/org/apache/nutch/metadata/Metadata.java @@ -36,7 +36,7 @@ public class Metadata implements Writable, CreativeCommons, DublinCore, /** * A map of all metadata attributes. */ - private Map metadata = null; + protected Map metadata = null; /** * Constructs a new, empty metadata. @@ -66,7 +66,7 @@ public class Metadata implements Writable, CreativeCommons, DublinCore, } /** - * Get the value associated to a metadata name. If many values are assiociated + * Get the value associated to a metadata name. If many values are associated * to the specified name, then the first one is returned. * * @param name diff --git a/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java b/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java index fdbf1b62c..be161440e 100644 --- a/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java +++ b/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java @@ -25,7 +25,7 @@ import org.apache.commons.lang.StringUtils; /** * A decorator to Metadata that adds spellchecking capabilities to property - * names. Currently used spelling vocabulary contains just the httpheaders from + * names. Currently used spelling vocabulary contains just the HTTP headers from * {@link HttpHeaders} class. * */ @@ -94,7 +94,7 @@ public class SpellCheckedMetadata extends Metadata { /** * Get the normalized name of metadata attribu
[nutch] branch master updated (a1ab4333e -> a74b57b90)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from a1ab4333e NUTCH-2897 Do not supress deprecated API warnings - deprecate constructor of NutchJob - remove deprocated call to Object.finalize() from Plugin.finalize() add a74b57b90 NUTCH-2853 bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean No new revisions were added by this update. Summary of changes: src/bin/nutch | 16 1 file changed, 4 insertions(+), 12 deletions(-)
[nutch] branch master updated: NUTCH-2897 Do not supress deprecated API warnings - deprecate constructor of NutchJob - remove deprocated call to Object.finalize() from Plugin.finalize()
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new a1ab4333e NUTCH-2897 Do not supress deprecated API warnings - deprecate constructor of NutchJob - remove deprocated call to Object.finalize() from Plugin.finalize() a1ab4333e is described below commit a1ab4333e0a1a28ac2e0f9c75871f7feeb5f2f81 Author: Sebastian Nagel AuthorDate: Sat Sep 30 11:12:07 2023 +0200 NUTCH-2897 Do not supress deprecated API warnings - deprecate constructor of NutchJob - remove deprocated call to Object.finalize() from Plugin.finalize() --- src/java/org/apache/nutch/plugin/Plugin.java | 2 -- src/java/org/apache/nutch/util/NutchJob.java | 13 - 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/src/java/org/apache/nutch/plugin/Plugin.java b/src/java/org/apache/nutch/plugin/Plugin.java index b2e717d20..3a0fb2e91 100644 --- a/src/java/org/apache/nutch/plugin/Plugin.java +++ b/src/java/org/apache/nutch/plugin/Plugin.java @@ -90,9 +90,7 @@ public class Plugin { } @Override - @SuppressWarnings("deprecation") protected void finalize() throws Throwable { -super.finalize(); shutDown(); } } diff --git a/src/java/org/apache/nutch/util/NutchJob.java b/src/java/org/apache/nutch/util/NutchJob.java index 478b24f89..068c64fef 100644 --- a/src/java/org/apache/nutch/util/NutchJob.java +++ b/src/java/org/apache/nutch/util/NutchJob.java @@ -35,7 +35,18 @@ public class NutchJob extends Job { private static final String JOB_FAILURE_LOG_FORMAT = "%s job did not succeed, job id: %s, job status: %s, reason: %s"; - @SuppressWarnings("deprecation") + /** + * @deprecated, use instead {@link #getInstance(Configuration)} or + * {@link Job#getInstance(Configuration, String)}. + * + * @param conf + * configuration for the job + * @param jobName + * name of the job + * @throws IOException + * see {@link Job#Job(Configuration, String)} + */ + @Deprecated public NutchJob(Configuration conf, String jobName) throws IOException { super(conf, jobName); if (conf != null) {
[nutch] branch master updated: NUTCH-3010 Injector: count unique number of injected URLs - add counter urls_injected_unique - improve log messages reporting the counts of injected/merged URLs
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 810b1d6ad NUTCH-3010 Injector: count unique number of injected URLs - add counter urls_injected_unique - improve log messages reporting the counts of injected/merged URLs 810b1d6ad is described below commit 810b1d6ad50fa9021469b4ca5e1db9050a3263c5 Author: Sebastian Nagel AuthorDate: Sat Sep 30 08:09:18 2023 +0200 NUTCH-3010 Injector: count unique number of injected URLs - add counter urls_injected_unique - improve log messages reporting the counts of injected/merged URLs --- src/java/org/apache/nutch/crawl/Injector.java | 31 --- 1 file changed, 18 insertions(+), 13 deletions(-) diff --git a/src/java/org/apache/nutch/crawl/Injector.java b/src/java/org/apache/nutch/crawl/Injector.java index b93e8ca76..9fca719f6 100644 --- a/src/java/org/apache/nutch/crawl/Injector.java +++ b/src/java/org/apache/nutch/crawl/Injector.java @@ -341,8 +341,11 @@ public class Injector extends NutchTool implements Tool { ? injected.getFetchInterval() : old.getFetchInterval()); } } - if (injectedSet && oldSet) { -context.getCounter("injector", "urls_merged").increment(1); + if (injectedSet) { +context.getCounter("injector", "urls_injected_unique").increment(1); +if (oldSet) { + context.getCounter("injector", "urls_merged").increment(1); +} } context.write(key, result); } @@ -448,22 +451,24 @@ public class Injector extends NutchTool implements Tool { if (LOG.isInfoEnabled()) { long urlsInjected = job.getCounters() .findCounter("injector", "urls_injected").getValue(); +long urlsInjectedUniq = job.getCounters() +.findCounter("injector", "urls_injected_unique").getValue(); long urlsFiltered = job.getCounters() .findCounter("injector", "urls_filtered").getValue(); long urlsMerged = job.getCounters() .findCounter("injector", "urls_merged").getValue(); -long urlsPurged404= job.getCounters() +long urlsPurged404 = job.getCounters() .findCounter("injector", "urls_purged_404").getValue(); -long urlsPurgedFilter= job.getCounters() +long urlsPurgedFilter = job.getCounters() .findCounter("injector", "urls_purged_filter").getValue(); -LOG.info("Injector: Total urls rejected by filters: " + urlsFiltered); +LOG.info("Injector: Total urls rejected by filters: {}", urlsFiltered); LOG.info( -"Injector: Total urls injected after normalization and filtering: " -+ urlsInjected); -LOG.info("Injector: Total urls injected but already in CrawlDb: " -+ urlsMerged); -LOG.info("Injector: Total new urls injected: " -+ (urlsInjected - urlsMerged)); +"Injector: Total urls injected after normalization and filtering: {} (unique URLs: {})", +urlsInjected, urlsInjectedUniq); +LOG.info("Injector: Total urls injected but already in CrawlDb: {}", +urlsMerged); +LOG.info("Injector: Total new urls injected: {}", +(urlsInjectedUniq - urlsMerged)); if (filterNormalizeAll) { LOG.info("Injector: Total urls removed from CrawlDb by filters: {}", urlsPurgedFilter); @@ -475,8 +480,8 @@ public class Injector extends NutchTool implements Tool { } long end = System.currentTimeMillis(); -LOG.info("Injector: finished at " + sdf.format(end) + ", elapsed: " -+ TimingUtil.elapsedTime(start, end)); +LOG.info("Injector: finished at {}, elapsed: {}", sdf.format(end), +TimingUtil.elapsedTime(start, end)); } } catch (IOException | InterruptedException | ClassNotFoundException | NullPointerException e) { LOG.error("Injector job failed: {}", e.getMessage());
[nutch] branch master updated (417b87732 -> a72a53a32)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from 417b87732 NUTCH-2852 SpotBugs: Method invokes System.exit(...) - remove all calls of System.exit(...) in methods except main(args) of various "checker" tools add a72a53a32 NUTCH-3007 Fix impossible casts - remove code blocks (else clauses) unneeded and containing impossible casts No new revisions were added by this update. Summary of changes: src/java/org/apache/nutch/fetcher/Fetcher.java| 13 ++--- src/java/org/apache/nutch/parse/ParseSegment.java | 13 ++--- 2 files changed, 4 insertions(+), 22 deletions(-)
[nutch] branch master updated: NUTCH-2852 SpotBugs: Method invokes System.exit(...) - remove all calls of System.exit(...) in methods except main(args) of various "checker" tools
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 417b87732 NUTCH-2852 SpotBugs: Method invokes System.exit(...) - remove all calls of System.exit(...) in methods except main(args) of various "checker" tools 417b87732 is described below commit 417b8773231136eb48957f743c2bc3c21f624d4e Author: Sebastian Nagel AuthorDate: Thu Sep 28 12:05:50 2023 +0200 NUTCH-2852 SpotBugs: Method invokes System.exit(...) - remove all calls of System.exit(...) in methods except main(args) of various "checker" tools --- src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java | 4 ++-- src/java/org/apache/nutch/net/URLFilterChecker.java | 4 ++-- src/java/org/apache/nutch/net/URLNormalizerChecker.java | 4 ++-- src/java/org/apache/nutch/parse/ParserChecker.java| 4 ++-- src/java/org/apache/nutch/util/AbstractChecker.java | 9 - 5 files changed, 12 insertions(+), 13 deletions(-) diff --git a/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java b/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java index 3aa7a05cb..1931c360d 100644 --- a/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java +++ b/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java @@ -93,7 +93,7 @@ public class IndexingFiltersChecker extends AbstractChecker { // Print help when no args given if (args.length < 1) { System.err.println(usage); - System.exit(-1); + return -1; } // read property "doIndex" for back-ward compatibility @@ -126,7 +126,7 @@ public class IndexingFiltersChecker extends AbstractChecker { } else if (i != args.length - 1) { System.err.println("ERR: Not a recognized argument: " + args[i]); System.err.println(usage); -System.exit(-1); +return -1; } else { url = args[i]; } diff --git a/src/java/org/apache/nutch/net/URLFilterChecker.java b/src/java/org/apache/nutch/net/URLFilterChecker.java index 7916cc579..821f2e926 100644 --- a/src/java/org/apache/nutch/net/URLFilterChecker.java +++ b/src/java/org/apache/nutch/net/URLFilterChecker.java @@ -41,7 +41,7 @@ public class URLFilterChecker extends AbstractChecker { // Print help when no args given if (args.length < 1) { System.err.println(usage); - System.exit(-1); + return -1; } int numConsumed; @@ -53,7 +53,7 @@ public class URLFilterChecker extends AbstractChecker { } else { System.err.println("ERROR: Not a recognized argument: " + args[i]); System.err.println(usage); -System.exit(-1); +return -1; } } diff --git a/src/java/org/apache/nutch/net/URLNormalizerChecker.java b/src/java/org/apache/nutch/net/URLNormalizerChecker.java index 586c7b246..46fdd38cf 100644 --- a/src/java/org/apache/nutch/net/URLNormalizerChecker.java +++ b/src/java/org/apache/nutch/net/URLNormalizerChecker.java @@ -44,7 +44,7 @@ public class URLNormalizerChecker extends AbstractChecker { // Print help when no args given if (args.length < 1) { System.err.println(usage); - System.exit(-1); + return -1; } int numConsumed; @@ -58,7 +58,7 @@ public class URLNormalizerChecker extends AbstractChecker { } else { System.err.println("ERROR: Not a recognized argument: " + args[i]); System.err.println(usage); -System.exit(-1); +return -1; } } diff --git a/src/java/org/apache/nutch/parse/ParserChecker.java b/src/java/org/apache/nutch/parse/ParserChecker.java index 1533ab57c..10eec4b24 100644 --- a/src/java/org/apache/nutch/parse/ParserChecker.java +++ b/src/java/org/apache/nutch/parse/ParserChecker.java @@ -104,7 +104,7 @@ public class ParserChecker extends AbstractChecker { // Print help when no args given if (args.length < 1) { System.err.println(usage); - System.exit(-1); + return -1; } // initialize plugins early to register URL stream handlers to support @@ -138,7 +138,7 @@ public class ParserChecker extends AbstractChecker { } else if (i != args.length - 1) { System.err.println("ERR: Not a recognized argument: " + args[i]); System.err.println(usage); -System.exit(-1); +return -1; } else { url = args[i]; } diff --git a/src/java/org/apache/nutch/util/AbstractChecker.java b/src/java/org/apache/nutch/util/AbstractChecker.java index 3116ede14..137481225 100644 --- a/src/java/org/apache/nutch/util/AbstractChecker.java +++ b/src/java/org/apache/nutch/util/AbstractChecker.java @@ -72,8 +72,7 @@ public abstract class AbstractChecker extends Configured imp
[nutch] branch master updated: NUTCH-2997 Add Override annotations
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 0fae6b59f NUTCH-2997 Add Override annotations 0fae6b59f is described below commit 0fae6b59fd85f2ec894a28089c1d086b2604660a Author: Sebastian Nagel AuthorDate: Mon Aug 14 16:08:58 2023 +0200 NUTCH-2997 Add Override annotations --- src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java | 8 src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java | 1 + src/java/org/apache/nutch/crawl/CrawlDatum.java | 8 src/java/org/apache/nutch/crawl/CrawlDbReducer.java | 1 + src/java/org/apache/nutch/crawl/Generator.java | 5 + src/java/org/apache/nutch/crawl/Inlink.java | 5 + src/java/org/apache/nutch/crawl/Inlinks.java | 3 +++ src/java/org/apache/nutch/crawl/LinkDbReader.java| 1 + src/java/org/apache/nutch/crawl/MD5Signature.java| 1 + src/java/org/apache/nutch/crawl/MimeAdaptiveFetchSchedule.java | 1 + src/java/org/apache/nutch/crawl/Signature.java | 2 ++ src/java/org/apache/nutch/crawl/SignatureComparator.java | 1 + src/java/org/apache/nutch/crawl/TextMD5Signature.java| 1 + src/java/org/apache/nutch/crawl/TextProfileSignature.java| 3 +++ src/java/org/apache/nutch/crawl/URLPartitioner.java | 1 + src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java | 2 ++ src/java/org/apache/nutch/fetcher/FetcherThread.java | 1 + src/java/org/apache/nutch/fetcher/QueueFeeder.java | 1 + src/java/org/apache/nutch/hostdb/ResolverThread.java | 1 + src/java/org/apache/nutch/indexer/IndexerOutputFormat.java | 2 ++ src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java| 1 + src/java/org/apache/nutch/indexer/NutchDocument.java | 4 src/java/org/apache/nutch/indexer/NutchIndexAction.java | 2 ++ src/java/org/apache/nutch/metadata/MetaWrapper.java | 2 ++ src/java/org/apache/nutch/metadata/Metadata.java | 3 +++ src/java/org/apache/nutch/net/URLFilterChecker.java | 1 + src/java/org/apache/nutch/net/URLNormalizerChecker.java | 1 + src/java/org/apache/nutch/parse/HTMLMetaTags.java| 1 + src/java/org/apache/nutch/parse/Outlink.java | 4 src/java/org/apache/nutch/parse/ParseData.java | 4 src/java/org/apache/nutch/parse/ParseImpl.java | 5 + src/java/org/apache/nutch/parse/ParseOutputFormat.java | 3 +++ src/java/org/apache/nutch/parse/ParseResult.java | 1 + src/java/org/apache/nutch/parse/ParseStatus.java | 7 +++ src/java/org/apache/nutch/parse/ParseText.java | 2 ++ src/java/org/apache/nutch/parse/ParserChecker.java | 1 + src/java/org/apache/nutch/plugin/Extension.java | 1 + src/java/org/apache/nutch/plugin/Plugin.java | 1 + src/java/org/apache/nutch/plugin/PluginClassLoader.java | 3 +++ src/java/org/apache/nutch/plugin/PluginRepository.java | 2 ++ src/java/org/apache/nutch/protocol/Content.java | 4 src/java/org/apache/nutch/protocol/ProtocolStatus.java | 4 src/java/org/apache/nutch/scoring/ScoringFilters.java| 9 + src/java/org/apache/nutch/scoring/webgraph/LinkDatum.java| 3 +++ src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java | 4 src/java/org/apache/nutch/scoring/webgraph/Node.java | 3 +++ src/java/org/apache/nutch/segment/ContentAsTextInputFormat.java | 6 ++ src/java/org/apache/nutch/segment/SegmentMerger.java | 1 + src/java/org/apache/nutch/segment/SegmentPart.java | 1 + src/java/org/apache/nutch/segment/SegmentReader.java | 6 ++ src/java/org/apache/nutch/service/impl/ConfManagerImpl.java | 6 ++ src/java/org/apache/nutch/service/impl/SeedManagerImpl.java | 4 src/java/org/apache/nutch/service/resources/AdminResource.java | 1 + src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java | 4 src/java/org/apache/nutch/tools/CommonCrawlFormat.java | 1 + src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java | 9 ++--- src/java/org/apache/nutch/tools/CommonCrawlFormatWARC.java | 1 + src/java/org/apache/nutch/tools/DmozParser.java | 2 ++ src/java/org/apache/nutch/tools/ResolveUrls.java | 1 + src/java/org/apache/nutch/tools/arc/ArcInputFormat.java | 1 + src/java/org/apache/nutch/tools/arc/ArcRecordReader.java | 6 ++ src
[nutch] branch master updated: NUTCH-2996 Use new SimpleRobotRulesParser API entry point crawler-commons 1.4
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 070c115cf NUTCH-2996 Use new SimpleRobotRulesParser API entry point crawler-commons 1.4 070c115cf is described below commit 070c115cfadbc937a8ad0add6447461983e92028 Author: Sebastian Nagel AuthorDate: Tue Aug 22 11:39:22 2023 +0200 NUTCH-2996 Use new SimpleRobotRulesParser API entry point crawler-commons 1.4 - split and lowercase agent names (if multiple) at configuration time and pass as collection to SimpleRobotRulesParser - update RobotRulesParser command-line help - update unit tests to use new API - update description of Nutch properties to reflect the changes due to the usage of the new API entry point and the upgrade to crawler-commons 1.4 --- conf/nutch-default.xml | 34 + .../apache/nutch/protocol/RobotRulesParser.java| 71 +- .../protocol/http/api/TestRobotRulesParser.java| 87 -- 3 files changed, 135 insertions(+), 57 deletions(-) diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index 379b5ef5d..e98bd5570 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -72,9 +72,18 @@ http.agent.name - HTTP 'User-Agent' request header. MUST NOT be empty - + 'User-Agent' name: a single word uniquely identifying your crawler. + + The value is used to select the group of robots.txt rules addressing your + crawler. It is also sent as part of the HTTP 'User-Agent' request header. + + This property MUST NOT be empty - please set this to a single word uniquely related to your organization. + Following RFC 9309 the 'User-Agent' name (aka. 'product token') + MUST contain only uppercase and lowercase letters ('a-z' and + 'A-Z'), underscores ('_'), and hyphens ('-'). + NOTE: You should also check other related properties: http.robots.agents @@ -84,7 +93,6 @@ http.agent.version and set their values appropriately. - @@ -95,13 +103,13 @@ parser would look for in robots.txt. Multiple agents can be provided using comma as a delimiter. eg. mybot,foo-spider,bar-crawler - The ordering of agents does NOT matter and the robots parser would make - decision based on the agent which matches first to the robots rules. - Also, there is NO need to add a wildcard (ie. "*") to this string as the - robots parser would smartly take care of a no-match situation. + The ordering of agents does NOT matter and the robots.txt parser combines + all rules to any of the agent names. Also, there is NO need to add + a wildcard (ie. "*") to this string as the robots parser would smartly + take care of a no-match situation. If no value is specified, by default HTTP agent (ie. 'http.agent.name') - would be used for user agent matching by the robots parser. + is used for user-agent matching by the robots parser. @@ -166,9 +174,9 @@ http.agent.url - A URL to advertise in the User-Agent header. This will + A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this - should be a URL of a page explaining the purpose and behavior of this + should be a URL to a page that explains the purpose and behavior of this crawler. @@ -176,9 +184,9 @@ http.agent.email - An email address to advertise in the HTTP 'From' request - header and User-Agent header. A good practice is to mangle this - address (e.g. 'info at example dot com') to avoid spamming. + An email address to advertise in the HTTP 'User-Agent' (and + 'From') request headers. A good practice is to mangle this address + (e.g. 'info at example dot com') to avoid spamming. @@ -202,7 +210,7 @@ http.agent.rotate.file agents.txt -File containing alternative user agent names to be used instead of +File containing alternative user-agent names to be used instead of http.agent.name on a rotating basis if http.agent.rotate is true. Each line of the file should contain exactly one agent specification including name, version, description, URL, etc. diff --git a/src/java/org/apache/nutch/protocol/RobotRulesParser.java b/src/java/org/apache/nutch/protocol/RobotRulesParser.java index 1493bc292..562c2c694 100644 --- a/src/java/org/apache/nutch/protocol/RobotRulesParser.java +++ b/src/java/org/apache/nutch/protocol/RobotRulesParser.java @@ -24,12 +24,13 @@ import java.io.LineNumberReader; import java.lang.invoke.MethodHandles; import java.net.MalformedURLException; import java.net.URL; +import java.util.Collection; import java.util.HashSet; import java.util.Hashtable; +import java.util.LinkedHashSet; import java.util.LinkedList; import java.util.List; import ja
[nutch] branch master updated: NUTCH-2995 Upgrade to crawler-commons 1.4
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new a24ec5c5b NUTCH-2995 Upgrade to crawler-commons 1.4 a24ec5c5b is described below commit a24ec5c5b761476897c7fff0bfd3d5107995fedc Author: Sebastian Nagel AuthorDate: Tue Aug 22 10:36:45 2023 +0200 NUTCH-2995 Upgrade to crawler-commons 1.4 - upgrade to crawler-commons from 1.3 to 1.4 - update Javadoc and improve code formatting of robots.txt unit tests - fix robots.txt unit tests to reflect changes in crawler-commons due to RFC 9309 compliance and merging of rule groups (see https://www.rfc-editor.org/rfc/rfc9309.html#section-2.2.1) - mark unit tests for deprecated API endpoints as deprecated --- ivy/ivy.xml| 2 +- .../protocol/http/api/TestRobotRulesParser.java| 102 +++-- 2 files changed, 74 insertions(+), 30 deletions(-) diff --git a/ivy/ivy.xml b/ivy/ivy.xml index 269f521c8..18a6df230 100644 --- a/ivy/ivy.xml +++ b/ivy/ivy.xml @@ -65,7 +65,7 @@ - + diff --git a/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java b/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java index 93bb51b22..265abf934 100644 --- a/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java +++ b/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java @@ -22,32 +22,37 @@ import org.junit.Test; import crawlercommons.robots.BaseRobotRules; /** - * JUnit test case which tests 1. that robots filtering is performed correctly - * as per the agent name 2. that crawl delay is extracted correctly from the - * robots file - * + * JUnit test case which tests + * + * that robots filtering is performed correctly as per the agent name + * that crawl delay is extracted correctly from the robots.txt file + * */ public class TestRobotRulesParser { private static final String CONTENT_TYPE = "text/plain"; - private static final String SINGLE_AGENT = "Agent1"; - private static final String MULTIPLE_AGENTS = "Agent2, Agent1"; + private static final String SINGLE_AGENT1 = "Agent1"; + private static final String SINGLE_AGENT2 = "Agent2"; + private static final String MULTIPLE_AGENTS = "Agent2, Agent1"; // rules are merged for both agents private static final String UNKNOWN_AGENT = "AgentABC"; private static final String CR = "\r"; - private static final String ROBOTS_STRING = "User-Agent: Agent1 #foo" + CR - + "Disallow: /a" + CR + "Disallow: /b/a" + CR + "#Disallow: /c" - + CR - + "Crawl-delay: 10" - + CR // set crawl delay for Agent1 as 10 sec - + "" + CR + "" + CR + "User-Agent: Agent2" + CR + "Disallow: /a/bloh" - + CR + "Disallow: /c" + CR + "Disallow: /foo" + CR + "Crawl-delay: 20" - + CR + "" + CR + "User-Agent: *" + CR + "Disallow: /foo/bar/" + CR; // no - // crawl - // delay - // for - // other - // agents + private static final String ROBOTS_STRING = // + "User-Agent: Agent1 #foo" + CR // + + "Disallow: /a" + CR // + + "Disallow: /b/a" + CR // + + "#Disallow: /c" + CR // + + "Crawl-delay: 10" + CR // set crawl delay for Agent1 as 10 seconds + + "" + CR // + + "" + CR // + + "User-Agent: Agent2" + CR // + + "Disallow: /a/bloh" + CR // + + "Disallow: /c" + CR // + + "Disallow: /foo" + CR // + + "Crawl-delay: 20" + CR // Agent2: 20 seconds + + "" + CR // + + "User-Agent: *" + CR // + + "Disallow: /foo/bar/" + CR; // no crawl delay for other agents private static final String[] TEST_PATHS = new String[] { "http://example.com/a;, "http://example.com/a/bloh/foo.html;, @@ -55,7 +60,8 @@ public class TestRobotRulesParser { "http://example.com/b/a/index.html;, "http://example.com/foo/bar/baz.html
[nutch] branch master updated: NUTCH-2993 ScoringDepth plugin to skip depth check based on URL Pattern - apply patch contributed by Markus Jelsma
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new eae3c52a8 NUTCH-2993 ScoringDepth plugin to skip depth check based on URL Pattern - apply patch contributed by Markus Jelsma eae3c52a8 is described below commit eae3c52a8140344dff46c448664a2467d631cefc Author: Sebastian Nagel AuthorDate: Thu Jul 20 13:44:26 2023 +0200 NUTCH-2993 ScoringDepth plugin to skip depth check based on URL Pattern - apply patch contributed by Markus Jelsma --- conf/nutch-default.xml | 16 ++ .../nutch/scoring/depth/DepthScoringFilter.java| 25 ++ 2 files changed, 41 insertions(+) diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index 273cfccc5..379b5ef5d 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -1918,6 +1918,22 @@ CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this + + scoring.depth.override.pattern + + URLs matching this pattern pass a different max depth value + to their outlinks configured in scoring.depth.max.override. + + + + + scoring.depth.max.override + + This max depth value is passed to outlinks matching the pattern + configured in scoring.depth.override.pattern. + + +
[nutch-site] branch asf-staging updated: Add logo on URL path where requested README.md in source code repository
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch asf-staging in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/asf-staging by this push: new e1f939c Add logo on URL path where requested README.md in source code repository e1f939c is described below commit e1f939cb5820423eb00331d783f6934656d2e37c Author: Sebastian Nagel AuthorDate: Fri Aug 4 20:07:37 2023 +0200 Add logo on URL path where requested README.md in source code repository --- content/assets/img/nutch_logo_tm.png | Bin 0 -> 9984 bytes 1 file changed, 0 insertions(+), 0 deletions(-) diff --git a/content/assets/img/nutch_logo_tm.png b/content/assets/img/nutch_logo_tm.png new file mode 100644 index 000..67b0eba Binary files /dev/null and b/content/assets/img/nutch_logo_tm.png differ
[nutch-site] branch main updated: Add logo on URL path where requested README.md in source code repository
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/main by this push: new c80dcca Add logo on URL path where requested README.md in source code repository c80dcca is described below commit c80dccaaab9e5084d0229a9916b51d93e9590b3a Author: Sebastian Nagel AuthorDate: Fri Aug 4 20:07:04 2023 +0200 Add logo on URL path where requested README.md in source code repository --- content/assets/img/nutch_logo_tm.png | Bin 0 -> 9984 bytes 1 file changed, 0 insertions(+), 0 deletions(-) diff --git a/content/assets/img/nutch_logo_tm.png b/content/assets/img/nutch_logo_tm.png new file mode 100644 index 000..67b0eba Binary files /dev/null and b/content/assets/img/nutch_logo_tm.png differ
[nutch-site] branch asf-site updated: Add logo on URL path where requested README.md in source code repository
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/asf-site by this push: new 3962502 Add logo on URL path where requested README.md in source code repository 3962502 is described below commit 3962502176832a616931fa9bff41f3e119071928 Author: Sebastian Nagel AuthorDate: Fri Aug 4 20:05:34 2023 +0200 Add logo on URL path where requested README.md in source code repository --- content/assets/img/nutch_logo_tm.png | Bin 0 -> 9984 bytes 1 file changed, 0 insertions(+), 0 deletions(-) diff --git a/content/assets/img/nutch_logo_tm.png b/content/assets/img/nutch_logo_tm.png new file mode 100644 index 000..67b0eba Binary files /dev/null and b/content/assets/img/nutch_logo_tm.png differ
[nutch-site] branch asf-site updated: Add link to ASF privacy policies
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/asf-site by this push: new c252fd7 Add link to ASF privacy policies c252fd7 is described below commit c252fd76668ec9d30c3a1b8ede341ed83e9fb203 Author: Sebastian Nagel AuthorDate: Thu Jul 20 10:56:16 2023 +0200 Add link to ASF privacy policies --- content/apache/index.html | 1 + content/index.xml | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/content/apache/index.html b/content/apache/index.html index 01641dc..6e832f5 100644 --- a/content/apache/index.html +++ b/content/apache/index.html @@ -112,6 +112,7 @@ The https://www.apache.org/security/;>Apache Security Team The https://www.apache.org/foundation/sponsorship.html;>Apache Software Foundation Sponsorship Program https://www.apache.org/foundation/thanks.html;>Sponsors and Thanks +https://privacy.apache.org/policies/privacy-policy-public.html;>ASF Privacy Policies diff --git a/content/index.xml b/content/index.xml index d4296bc..dee6d72 100644 --- a/content/index.xml +++ b/content/index.xml @@ -55,7 +55,7 @@ As usual in the 1.X series, release artifacts are made available as both source Mon, 01 Jan 0001 00:00:00 + /apache/ - Visit the Apache Software Foundation Homepage Information about the Apache Licenses The Apache Security Team The Apache Software Foundation Sponsorship Program Sponsors and Thanks + Visit the Apache Software Foundation Homepage Information about the Apache Licenses The Apache Security Team The Apache Software Foundation Sponsorship Program Sponsors and Thanks ASF Privacy Policies
[nutch-site] branch main updated: Add link to ASF privacy policies
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/main by this push: new d0832c1 Add link to ASF privacy policies d0832c1 is described below commit d0832c177981842bc7c67e019bfa1a6eb07ff39d Author: Sebastian Nagel AuthorDate: Thu Jul 20 10:55:31 2023 +0200 Add link to ASF privacy policies --- content/apache.md | 1 + 1 file changed, 1 insertion(+) diff --git a/content/apache.md b/content/apache.md index b27f09c..717e854 100644 --- a/content/apache.md +++ b/content/apache.md @@ -12,5 +12,6 @@ bref = "" * The [Apache Security Team](https://www.apache.org/security/) * The [Apache Software Foundation Sponsorship Program](https://www.apache.org/foundation/sponsorship.html) * [Sponsors and Thanks](https://www.apache.org/foundation/thanks.html) +* [ASF Privacy Policies](https://privacy.apache.org/policies/privacy-policy-public.html) *
[nutch-site] branch asf-staging updated: Add link to ASF privacy policies
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch asf-staging in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/asf-staging by this push: new 3ff0ddb Add link to ASF privacy policies 3ff0ddb is described below commit 3ff0ddb731690693fba8db14465173ea578d61d2 Author: Sebastian Nagel AuthorDate: Thu Jul 20 10:56:16 2023 +0200 Add link to ASF privacy policies --- content/apache/index.html | 1 + content/index.xml | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/content/apache/index.html b/content/apache/index.html index 01641dc..6e832f5 100644 --- a/content/apache/index.html +++ b/content/apache/index.html @@ -112,6 +112,7 @@ The https://www.apache.org/security/;>Apache Security Team The https://www.apache.org/foundation/sponsorship.html;>Apache Software Foundation Sponsorship Program https://www.apache.org/foundation/thanks.html;>Sponsors and Thanks +https://privacy.apache.org/policies/privacy-policy-public.html;>ASF Privacy Policies diff --git a/content/index.xml b/content/index.xml index d4296bc..dee6d72 100644 --- a/content/index.xml +++ b/content/index.xml @@ -55,7 +55,7 @@ As usual in the 1.X series, release artifacts are made available as both source Mon, 01 Jan 0001 00:00:00 + /apache/ - Visit the Apache Software Foundation Homepage Information about the Apache Licenses The Apache Security Team The Apache Software Foundation Sponsorship Program Sponsors and Thanks + Visit the Apache Software Foundation Homepage Information about the Apache Licenses The Apache Security Team The Apache Software Foundation Sponsorship Program Sponsors and Thanks ASF Privacy Policies
[nutch-site] 01/03: - add link / banner of Apache conferences or events - rename and move link to ASF
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/nutch-site.git commit 7cd1d1cce957346615a0cb1efbfd875932764d70 Author: Sebastian Nagel AuthorDate: Thu Jul 20 10:32:50 2023 +0200 - add link / banner of Apache conferences or events - rename and move link to ASF --- config.toml | 2 +- content/apache.md| 4 +++- themes/kube/layouts/_default/baseof.html | 1 + 3 files changed, 5 insertions(+), 2 deletions(-) diff --git a/config.toml b/config.toml index a78ef2d..290476d 100644 --- a/config.toml +++ b/config.toml @@ -39,7 +39,7 @@ unsafe = true # allow raw HTML in markdown content weight = -100 url = "/news/" [[menu.main]] -name = "Apache" +name = "The Apache Software Foundation" weight = -100 url = "/apache/" diff --git a/content/apache.md b/content/apache.md index e4aef9c..b27f09c 100644 --- a/content/apache.md +++ b/content/apache.md @@ -11,4 +11,6 @@ bref = "" * Information about the [Apache Licenses](https://www.apache.org/licenses/) * The [Apache Security Team](https://www.apache.org/security/) * The [Apache Software Foundation Sponsorship Program](https://www.apache.org/foundation/sponsorship.html) -* [Sponsors and Thanks](https://www.apache.org/foundation/thanks.html) \ No newline at end of file +* [Sponsors and Thanks](https://www.apache.org/foundation/thanks.html) +* + diff --git a/themes/kube/layouts/_default/baseof.html b/themes/kube/layouts/_default/baseof.html index 3f7ec06..fec7378 100644 --- a/themes/kube/layouts/_default/baseof.html +++ b/themes/kube/layouts/_default/baseof.html @@ -46,6 +46,7 @@ + https://www.apachecon.com/event-images/snippet.js"</a>;>
[nutch-site] 03/03: Add new committer / PMC
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/nutch-site.git commit db7208f4333d1208516db09b3ac4309d9402881c Author: Sebastian Nagel AuthorDate: Thu Jul 20 10:36:26 2023 +0200 Add new committer / PMC --- content/community/people-credits.md | 7 +++ 1 file changed, 7 insertions(+) diff --git a/content/community/people-credits.md b/content/community/people-credits.md index 9d66c7f..b17d5ec 100644 --- a/content/community/people-credits.md +++ b/content/community/people-credits.md @@ -169,6 +169,13 @@ bref = "" Committer, PMC Member Microsoft + + tallison + https://www.linkedin.com/in/tim-allison-5a6722/;>Tim Allison + tallison[at]apache[dot]org + Committer, PMC Member + NASA JPL +
[nutch-site] 02/03: Update copyright year 2022 -> 2023
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/nutch-site.git commit 44463bd9e75c654d775d9337989e46e75359ed1a Author: Sebastian Nagel AuthorDate: Thu Jul 20 10:35:46 2023 +0200 Update copyright year 2022 -> 2023 --- themes/kube/layouts/partials/footer.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/themes/kube/layouts/partials/footer.html b/themes/kube/layouts/partials/footer.html index 2081d5f..59fe554 100644 --- a/themes/kube/layouts/partials/footer.html +++ b/themes/kube/layouts/partials/footer.html @@ -1,3 +1,3 @@ - 2004-2022 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target="_blank" rel="noopener noreferrer">kube Theme for Hugo. Apache Nutch, Nutch, Apache, the Apache feather logo, and the Apache Nutch project logo are trademarks of The Apache Software Foundation. + 2004-2023 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target="_blank" rel="noopener noreferrer">kube Theme for Hugo. Apache Nutch, Nutch, Apache, the Apache feather logo, and the Apache Nutch project logo are trademarks of The Apache Software Foundation. \ No newline at end of file
[nutch-site] branch asf-site updated: - add new committer / PMC - update copyright year 2022 -> 2023 - add link / banner of Apache conferences or events - rename and move link to ASF
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/asf-site by this push: new 773089d - add new committer / PMC - update copyright year 2022 -> 2023 - add link / banner of Apache conferences or events - rename and move link to ASF 773089d is described below commit 773089d37c2f5a8a112275a71a8698f941562391 Author: Sebastian Nagel AuthorDate: Thu Jul 20 10:43:11 2023 +0200 - add new committer / PMC - update copyright year 2022 -> 2023 - add link / banner of Apache conferences or events - rename and move link to ASF --- content/apache/index.html | 11 ++- content/categories/index.html | 10 +- content/categories/news/index.html| 10 +- content/categories/news/page/1/index.html | 11 ++- content/categories/page/1/index.html | 11 ++- content/categories/releases/index.html| 10 +- content/categories/releases/page/1/index.html | 11 ++- content/community/board-reporting/index.html | 10 +- content/community/bot/index.html | 12 ++-- content/community/contributing/index.html | 10 +- content/community/index.html | 10 +- content/community/index.xml | 8 content/community/mailing-lists/index.html| 10 +- content/community/merchandise/index.html | 10 +- content/community/people-credits/index.html | 17 - content/development/index.html| 10 +- content/development/issue-tracker/index.html | 10 +- content/development/nightly-builds/index.html | 10 +- content/development/source-code-management/index.html | 10 +- content/documentation/about/index.html| 10 +- content/documentation/faqs/index.html | 10 +- content/documentation/index.html | 10 +- content/documentation/javadoc/index.html | 10 +- content/documentation/tutorials/index.html| 10 +- content/documentation/wiki/index.html | 10 +- content/download/index.html | 10 +- content/index.html| 10 +- content/index.xml | 12 ++-- content/news/index.html | 10 +- content/news/legacy-nutch-news/index.html | 10 +- content/news/nutch-1.18-release/index.html| 10 +- content/news/nutch-1.19-release/index.html| 10 +- content/news/page/1/index.html| 11 ++- content/tags/1.18/index.html | 10 +- content/tags/1.18/page/1/index.html | 11 ++- content/tags/1.19/index.html | 10 +- content/tags/1.19/page/1/index.html | 11 ++- content/tags/index.html | 10 +- content/tags/legacy/index.html| 10 +- content/tags/legacy/page/1/index.html | 11 ++- content/tags/news/index.html | 10 +- content/tags/news/page/1/index.html | 11 ++- content/tags/page/1/index.html| 11 ++- content/tags/page/2/index.html| 10 +- content/tags/release/index.html | 10 +- content/tags/release/page/1/index.html| 11 ++- 46 files changed, 289 insertions(+), 191 deletions(-) diff --git a/content/apache/index.html b/content/apache/index.html index 3d84f47..01641dc 100644 --- a/content/apache/index.html +++ b/content/apache/index.html @@ -2,7 +2,7 @@ - + @@ -28,7 +28,6 @@ - @@ -58,6 +57,7 @@ + https://www.apachecon.com/event-images/snippet.js"</a>;> @@ -77,8 +77,6 @@ -Apache - Community Development @@ -89,6 +87,8 @@ News +The Apache Software Foundation + @@ -112,6 +112,7 @@ The https://www.apache.org/security/;>Apache Security Team The https://www.apache.org/foundation/sponsorship.html;>Apache Software Foundation Sponsorship Program https://www.apache.org/foundation/thanks.html;>Sponsors and Thanks + @@ -119,7 +120,7 @@ - 2004-2022 The Apache Software Foundation. Built using the
[nutch-site] branch main updated (aa45c17 -> db7208f)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch main in repository https://gitbox.apache.org/repos/asf/nutch-site.git from aa45c17 Announce release of Nutch 1.19 - fix release data in announcement new 7cd1d1c - add link / banner of Apache conferences or events - rename and move link to ASF new 44463bd Update copyright year 2022 -> 2023 new db7208f Add new committer / PMC The 3 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: config.toml | 2 +- content/apache.md| 4 +++- content/community/people-credits.md | 7 +++ themes/kube/layouts/_default/baseof.html | 1 + themes/kube/layouts/partials/footer.html | 2 +- 5 files changed, 13 insertions(+), 3 deletions(-)
[nutch-site] branch asf-staging updated: - add new committer / PMC - update copyright year 2022 -> 2023 - add link / banner of Apache conferences or events - rename and move link to ASF
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch asf-staging in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/asf-staging by this push: new a864887 - add new committer / PMC - update copyright year 2022 -> 2023 - add link / banner of Apache conferences or events - rename and move link to ASF a864887 is described below commit a8648878452d3e14d000f3b33558c21fa7ee766c Author: Sebastian Nagel AuthorDate: Thu Jul 20 10:43:11 2023 +0200 - add new committer / PMC - update copyright year 2022 -> 2023 - add link / banner of Apache conferences or events - rename and move link to ASF --- content/apache/index.html | 11 ++- content/categories/index.html | 10 +- content/categories/news/index.html| 10 +- content/categories/news/page/1/index.html | 11 ++- content/categories/page/1/index.html | 11 ++- content/categories/releases/index.html| 10 +- content/categories/releases/page/1/index.html | 11 ++- content/community/board-reporting/index.html | 10 +- content/community/bot/index.html | 12 ++-- content/community/contributing/index.html | 10 +- content/community/index.html | 10 +- content/community/index.xml | 8 content/community/mailing-lists/index.html| 10 +- content/community/merchandise/index.html | 10 +- content/community/people-credits/index.html | 17 - content/development/index.html| 10 +- content/development/issue-tracker/index.html | 10 +- content/development/nightly-builds/index.html | 10 +- content/development/source-code-management/index.html | 10 +- content/documentation/about/index.html| 10 +- content/documentation/faqs/index.html | 10 +- content/documentation/index.html | 10 +- content/documentation/javadoc/index.html | 10 +- content/documentation/tutorials/index.html| 10 +- content/documentation/wiki/index.html | 10 +- content/download/index.html | 10 +- content/index.html| 10 +- content/index.xml | 12 ++-- content/news/index.html | 10 +- content/news/legacy-nutch-news/index.html | 10 +- content/news/nutch-1.18-release/index.html| 10 +- content/news/nutch-1.19-release/index.html| 10 +- content/news/page/1/index.html| 11 ++- content/tags/1.18/index.html | 10 +- content/tags/1.18/page/1/index.html | 11 ++- content/tags/1.19/index.html | 10 +- content/tags/1.19/page/1/index.html | 11 ++- content/tags/index.html | 10 +- content/tags/legacy/index.html| 10 +- content/tags/legacy/page/1/index.html | 11 ++- content/tags/news/index.html | 10 +- content/tags/news/page/1/index.html | 11 ++- content/tags/page/1/index.html| 11 ++- content/tags/page/2/index.html| 10 +- content/tags/release/index.html | 10 +- content/tags/release/page/1/index.html| 11 ++- 46 files changed, 289 insertions(+), 191 deletions(-) diff --git a/content/apache/index.html b/content/apache/index.html index 3d84f47..01641dc 100644 --- a/content/apache/index.html +++ b/content/apache/index.html @@ -2,7 +2,7 @@ - + @@ -28,7 +28,6 @@ - @@ -58,6 +57,7 @@ + https://www.apachecon.com/event-images/snippet.js"</a>;> @@ -77,8 +77,6 @@ -Apache - Community Development @@ -89,6 +87,8 @@ News +The Apache Software Foundation + @@ -112,6 +112,7 @@ The https://www.apache.org/security/;>Apache Security Team The https://www.apache.org/foundation/sponsorship.html;>Apache Software Foundation Sponsorship Program https://www.apache.org/foundation/thanks.html;>Sponsors and Thanks + @@ -119,7 +120,7 @@ - 2004-2022 The Apache Software Foundation. Built
[nutch] branch master updated: NUTCH-2991 Support HTTP/S Header Authorization for Solr connections (#763)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 9109bdd74 NUTCH-2991 Support HTTP/S Header Authorization for Solr connections (#763) 9109bdd74 is described below commit 9109bdd740ba578fc17745ebc9f53f464667 Author: Sebastian Nagel AuthorDate: Tue Jun 6 14:51:20 2023 +0200 NUTCH-2991 Support HTTP/S Header Authorization for Solr connections (#763) NUTCH-2991 Support HTTP/S Header Authorization for Solr connections (patch contributed by Marcos Gomez) - adds params auth.header.name and auth.header.value for JWT Authentication with Bearer Tokens sent via the HTTP Authorization header connections - also document basic authentication and improve error message when reading the configuration fails --- conf/index-writers.xml.template| 19 - .../org/apache/nutch/indexer/IndexWriters.java | 2 +- .../nutch/indexwriter/solr/SolrConstants.java | 4 + .../nutch/indexwriter/solr/SolrIndexWriter.java| 47 --- .../apache/nutch/indexwriter/solr/SolrUtils.java | 94 +- 5 files changed, 153 insertions(+), 13 deletions(-) diff --git a/conf/index-writers.xml.template b/conf/index-writers.xml.template index 549ebd4c9..6ed341cb7 100644 --- a/conf/index-writers.xml.template +++ b/conf/index-writers.xml.template @@ -26,9 +26,24 @@ + - - + + + + + + + diff --git a/src/java/org/apache/nutch/indexer/IndexWriters.java b/src/java/org/apache/nutch/indexer/IndexWriters.java index a8ab0ec9c..f8ae8ee86 100644 --- a/src/java/org/apache/nutch/indexer/IndexWriters.java +++ b/src/java/org/apache/nutch/indexer/IndexWriters.java @@ -137,7 +137,7 @@ public class IndexWriters { return indexWriterConfigs; } catch (SAXException | IOException | ParserConfigurationException e) { - LOG.error(e.toString()); + LOG.error("Failed to read index writers configuration: {}", e.getMessage()); return new IndexWriterConfig[0]; } } diff --git a/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java b/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java index 302ed75ed..ee6d5d623 100644 --- a/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java +++ b/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java @@ -34,4 +34,8 @@ public interface SolrConstants { String PASSWORD = "password"; + String AUTH_HEADER_NAME = "auth.header.name"; + + String AUTH_HEADER_VALUE = "auth.header.value"; + } diff --git a/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java b/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java index 12d3ff6b7..ec2ab46d2 100644 --- a/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java +++ b/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java @@ -16,8 +16,8 @@ */ package org.apache.nutch.indexwriter.solr; -import java.lang.invoke.MethodHandles; import java.io.IOException; +import java.lang.invoke.MethodHandles; import java.time.format.DateTimeFormatter; import java.util.AbstractMap; import java.util.ArrayList; @@ -72,6 +72,8 @@ public class SolrIndexWriter implements IndexWriter { private boolean auth; private String username; private String password; + private String authHeaderName; + private String authHeaderValue; @Override public void open(Configuration conf, String name) { @@ -99,20 +101,40 @@ public class SolrIndexWriter implements IndexWriter { this.auth = parameters.getBoolean(SolrConstants.USE_AUTH, false); this.username = parameters.get(SolrConstants.USERNAME); this.password = parameters.get(SolrConstants.PASSWORD); +this.authHeaderName = parameters.get(SolrConstants.AUTH_HEADER_NAME, ""); +this.authHeaderValue = parameters.get(SolrConstants.AUTH_HEADER_VALUE, ""); this.solrClients = new ArrayList<>(); switch (type) { case "http": for (String url : urls) { -solrClients.add(SolrUtils.getHttpSolrClient(url)); +if (this.auth && !StringUtil.isEmpty(this.authHeaderName) +&& !StringUtil.isEmpty(this.authHeaderValue)) { + solrClients.add(SolrUtils.getHttpSolrClientHeaderAuthorization(url, + this.authHeaderName, this.authHeaderValue)); +} else if (this.auth && !StringUtil.isEmpty(this.username) +&& !StringUtil.isEmpty(this.password)) { + solr
[nutch] branch master updated: NUTCH-2992 Fetcher: always block fetch queues when exceptions threshold is reached - if QueueFeeder is still alive, also block queues which are empty right now
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 98d02e70f NUTCH-2992 Fetcher: always block fetch queues when exceptions threshold is reached - if QueueFeeder is still alive, also block queues which are empty right now 98d02e70f is described below commit 98d02e70f6d83f4fb99abf89a990a3e13a933076 Author: Sebastian Nagel AuthorDate: Tue May 16 17:30:49 2023 +0200 NUTCH-2992 Fetcher: always block fetch queues when exceptions threshold is reached - if QueueFeeder is still alive, also block queues which are empty right now --- .../org/apache/nutch/fetcher/FetchItemQueues.java | 25 -- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/src/java/org/apache/nutch/fetcher/FetchItemQueues.java b/src/java/org/apache/nutch/fetcher/FetchItemQueues.java index 9dfbeb277..cec272b45 100644 --- a/src/java/org/apache/nutch/fetcher/FetchItemQueues.java +++ b/src/java/org/apache/nutch/fetcher/FetchItemQueues.java @@ -303,19 +303,22 @@ public class FetchItemQueues { "* queue: {} >> delayed next fetch by {} ms after {} exceptions in queue", queueid, exceptionDelay, excCount); } -if (fiq.getQueueSize() == 0) { - return 0; -} -if (maxExceptions!= -1 && excCount >= maxExceptions) { +if (maxExceptions != -1 && excCount >= maxExceptions) { // too many exceptions for items in this queue - purge it int deleted = fiq.emptyQueue(); - LOG.info( - "* queue: {} >> removed {} URLs from queue because {} exceptions occurred", - queueid, deleted, excCount); - totalSize.getAndAdd(-deleted); - // keep queue IDs to ensure that these queues aren't created and filled - // again, see addFetchItem(FetchItem) - queuesMaxExceptions.add(queueid); + if (deleted > 0) { +LOG.info( +"* queue: {} >> removed {} URLs from queue because {} exceptions occurred", +queueid, deleted, excCount); +totalSize.getAndAdd(-deleted); + } + if (feederAlive) { +LOG.info("* queue: {} >> blocked after {} exceptions", queueid, +excCount); +// keep queue IDs to ensure that these queues aren't created and filled +// again, see addFetchItem(FetchItem) +queuesMaxExceptions.add(queueid); + } return deleted; } return 0;
[nutch] branch master updated: NUTCH-2596 Upgrade from org.mortbay.jetty to org.eclipse.jetty - upgrade from org.mortbay.jetty 6.1.26 to org.eclipse.jetty 9.4.50 (Hadoop depends on 9.4.43) - remove
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 215993bc6 NUTCH-2596 Upgrade from org.mortbay.jetty to org.eclipse.jetty - upgrade from org.mortbay.jetty 6.1.26 to org.eclipse.jetty 9.4.50 (Hadoop depends on 9.4.43) - remove obsolete dependency exclusions of hadoop-common - upgrade Fetcher unit tests to use org.eclipse.jetty 215993bc6 is described below commit 215993bc6fbc58c050251410d5a7b02e601d99b3 Author: Sebastian Nagel AuthorDate: Thu Feb 23 15:46:28 2023 +0100 NUTCH-2596 Upgrade from org.mortbay.jetty to org.eclipse.jetty - upgrade from org.mortbay.jetty 6.1.26 to org.eclipse.jetty 9.4.50 (Hadoop depends on 9.4.43) - remove obsolete dependency exclusions of hadoop-common - upgrade Fetcher unit tests to use org.eclipse.jetty --- ivy/ivy.xml | 12 +++- src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java | 17 + src/test/org/apache/nutch/fetcher/TestFetcher.java | 5 ++--- 3 files changed, 14 insertions(+), 20 deletions(-) diff --git a/ivy/ivy.xml b/ivy/ivy.xml index 36a32a809..269f521c8 100644 --- a/ivy/ivy.xml +++ b/ivy/ivy.xml @@ -50,14 +50,7 @@ - - - - - - - - + @@ -112,7 +105,8 @@ - + + diff --git a/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java b/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java index e271e88cf..87da8faf2 100644 --- a/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java +++ b/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java @@ -48,10 +48,10 @@ import org.apache.hadoop.mapreduce.Partitioner; import org.apache.hadoop.mapreduce.TaskAttemptID; import org.apache.hadoop.mapreduce.Reducer.Context; import org.apache.hadoop.security.Credentials; -import org.mortbay.jetty.Server; -import org.mortbay.jetty.bio.SocketConnector; -import org.mortbay.jetty.handler.ContextHandler; -import org.mortbay.jetty.handler.ResourceHandler; +import org.eclipse.jetty.server.Server; +import org.eclipse.jetty.server.ServerConnector; +import org.eclipse.jetty.server.handler.ContextHandler; +import org.eclipse.jetty.server.handler.ResourceHandler; public class CrawlDBTestUtil { @@ -435,16 +435,17 @@ public class CrawlDBTestUtil { */ public static Server getServer(int port, String staticContent) throws UnknownHostException { -Server webServer = new org.mortbay.jetty.Server(); -SocketConnector listener = new SocketConnector(); +Server webServer = new Server(); + +ServerConnector listener = new ServerConnector(webServer); listener.setPort(port); listener.setHost("127.0.0.1"); webServer.addConnector(listener); ContextHandler staticContext = new ContextHandler(); staticContext.setContextPath("/"); staticContext.setResourceBase(staticContent); -staticContext.addHandler(new ResourceHandler()); -webServer.addHandler(staticContext); +staticContext.insertHandler(new ResourceHandler()); +webServer.insertHandler(staticContext); return webServer; } } diff --git a/src/test/org/apache/nutch/fetcher/TestFetcher.java b/src/test/org/apache/nutch/fetcher/TestFetcher.java index 245353fad..ecc135c52 100644 --- a/src/test/org/apache/nutch/fetcher/TestFetcher.java +++ b/src/test/org/apache/nutch/fetcher/TestFetcher.java @@ -36,7 +36,7 @@ import org.junit.After; import org.junit.Assert; import org.junit.Before; import org.junit.Test; -import org.mortbay.jetty.Server; +import org.eclipse.jetty.server.Server; /** * Basic fetcher test 1. generate seedlist 2. inject 3. generate 3. fetch 4. @@ -180,8 +180,7 @@ public class TestFetcher { } private void addUrl(ArrayList urls, String page) { -urls.add("http://127.0.0.1:; + server.getConnectors()[0].getPort() + "/" -+ page); +urls.add("http://127.0.0.1:; + server.getURI().getPort() + "/" + page); } @Test
[nutch] branch master updated: NUTCH-2984 Drop test proxy server and benchmark tool
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new b4cb5c1e3 NUTCH-2984 Drop test proxy server and benchmark tool b4cb5c1e3 is described below commit b4cb5c1e30a37b7eceed477fe2d71011bde042ed Author: Sebastian Nagel AuthorDate: Fri Feb 24 15:27:35 2023 +0100 NUTCH-2984 Drop test proxy server and benchmark tool --- build.xml | 33 --- ivy/ivy.xml| 1 - src/java/org/apache/nutch/tools/Benchmark.java | 289 - .../nutch/tools/proxy/AbstractTestbedHandler.java | 49 .../org/apache/nutch/tools/proxy/DelayHandler.java | 55 .../org/apache/nutch/tools/proxy/FakeHandler.java | 101 --- .../apache/nutch/tools/proxy/LogDebugHandler.java | 64 - .../apache/nutch/tools/proxy/NotFoundHandler.java | 39 --- .../org/apache/nutch/tools/proxy/ProxyTestbed.java | 157 --- .../apache/nutch/tools/proxy/SegmentHandler.java | 255 -- .../org/apache/nutch/tools/proxy/package-info.java | 22 -- 11 files changed, 1065 deletions(-) diff --git a/build.xml b/build.xml index cc88493f3..9326a8ba2 100644 --- a/build.xml +++ b/build.xml @@ -468,39 +468,6 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - diff --git a/ivy/ivy.xml b/ivy/ivy.xml index 0e7e25160..36a32a809 100644 --- a/ivy/ivy.xml +++ b/ivy/ivy.xml @@ -112,7 +112,6 @@ - diff --git a/src/java/org/apache/nutch/tools/Benchmark.java b/src/java/org/apache/nutch/tools/Benchmark.java deleted file mode 100644 index d7c3b74ae..0 --- a/src/java/org/apache/nutch/tools/Benchmark.java +++ /dev/null @@ -1,289 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.nutch.tools; - -import java.io.OutputStream; -import java.lang.invoke.MethodHandles; -import java.text.SimpleDateFormat; -import java.util.ArrayList; -import java.util.Date; -import java.util.HashMap; -import java.util.List; -import java.util.Map; - -import org.apache.hadoop.conf.Configuration; -import org.apache.hadoop.conf.Configured; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; -import org.apache.hadoop.util.Tool; -import org.apache.hadoop.mapreduce.Job; -import org.apache.hadoop.util.ToolRunner; -import org.apache.nutch.crawl.CrawlDb; -import org.apache.nutch.crawl.CrawlDbReader; -import org.apache.nutch.crawl.Generator; -import org.apache.nutch.crawl.Injector; -import org.apache.nutch.crawl.LinkDb; -import org.apache.nutch.fetcher.Fetcher; -import org.apache.nutch.parse.ParseSegment; -import org.apache.nutch.util.NutchConfiguration; -import org.apache.nutch.util.NutchJob; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -public class Benchmark extends Configured implements Tool { - - private static final Logger LOG = LoggerFactory - .getLogger(MethodHandles.lookup().lookupClass()); - - public static void main(String[] args) throws Exception { -Configuration conf = NutchConfiguration.create(); -int res = ToolRunner.run(conf, new Benchmark(), args); -System.exit(res); - } - - @SuppressWarnings("unused") - private static String getDate() { -return new SimpleDateFormat("MMddHHmmss").format(new Date(System -.currentTimeMillis())); - } - - private void createSeeds(FileSystem fs, Path seedsDir, int count) - throws Exception { -OutputStream os = fs.create(new Path(seedsDir, "seeds")); -for (int i = 0; i < count; i++) { - String url = "http://www.test-; + i + ".com/\r\n"; - os.write(url.getBytes()); -} -os.flush(); -os.close(); - } - - public static final class BenchmarkResults { -Map> timings = new HashMap<>(); -List runs = new ArrayList<>();
[nutch] branch master updated: NUTCH-2985 Disable plugin urlfilter-validator by default
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 1999b1e11 NUTCH-2985 Disable plugin urlfilter-validator by default 1999b1e11 is described below commit 1999b1e1199b773c8d08e4765cfa1824e99a9287 Author: Sebastian Nagel AuthorDate: Fri Feb 24 16:24:21 2023 +0100 NUTCH-2985 Disable plugin urlfilter-validator by default --- conf/nutch-default.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index 69351c843..273cfccc5 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -1590,7 +1590,7 @@ plugin.includes - protocol-http|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic) + protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic) Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. By default Nutch includes plugins to crawl HTML and various other
[nutch] branch master updated: NUTCH-2983 nutch-default.xml improvements - remove property "hadoop.job.history.user.location", obsolete since Hadoop 0.21.0 - normalize spelling (case) of URL and Crawl
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new c8aecfa5d NUTCH-2983 nutch-default.xml improvements - remove property "hadoop.job.history.user.location", obsolete since Hadoop 0.21.0 - normalize spelling (case) of URL and CrawlDb - trim trailing space - fix typos - improve description of properties {db,linkdb}.ignore.{ex,in}ternal.links c8aecfa5d is described below commit c8aecfa5de609f8d7f0744bc1a1dea525e09ebe9 Author: Sebastian Nagel AuthorDate: Fri Feb 17 17:18:32 2023 +0100 NUTCH-2983 nutch-default.xml improvements - remove property "hadoop.job.history.user.location", obsolete since Hadoop 0.21.0 - normalize spelling (case) of URL and CrawlDb - trim trailing space - fix typos - improve description of properties {db,linkdb}.ignore.{ex,in}ternal.links --- conf/nutch-default.xml | 278 - 1 file changed, 137 insertions(+), 141 deletions(-) diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index d05503d23..69351c843 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -33,7 +33,7 @@ confuse this setting with the http.content.limit setting. - + file.crawl.parent true @@ -72,7 +72,7 @@ http.agent.name - HTTP 'User-Agent' request header. MUST NOT be empty - + HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: @@ -92,23 +92,23 @@ http.robots.agents Any other agents, apart from 'http.agent.name', that the robots - parser would look for in robots.txt. Multiple agents can be provided using + parser would look for in robots.txt. Multiple agents can be provided using comma as a delimiter. eg. mybot,foo-spider,bar-crawler - - The ordering of agents does NOT matter and the robots parser would make - decision based on the agent which matches first to the robots rules. - Also, there is NO need to add a wildcard (ie. "*") to this string as the - robots parser would smartly take care of a no-match situation. - - If no value is specified, by default HTTP agent (ie. 'http.agent.name') - would be used for user agent matching by the robots parser. + + The ordering of agents does NOT matter and the robots parser would make + decision based on the agent which matches first to the robots rules. + Also, there is NO need to add a wildcard (ie. "*") to this string as the + robots parser would smartly take care of a no-match situation. + + If no value is specified, by default HTTP agent (ie. 'http.agent.name') + would be used for user agent matching by the robots parser. http.robot.rules.allowlist - Comma separated list of hostnames or IP addresses to ignore + Comma separated list of hostnames or IP addresses to ignore robot rules parsing for. Use with care and only if you are explicitly allowed by the site owner to ignore the site's robots.txt! Also keep in mind: ignoring the robots.txt rules means that no robots.txt @@ -166,7 +166,7 @@ http.agent.url - A URL to advertise in the User-Agent header. This will + A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. @@ -185,7 +185,7 @@ http.agent.version Nutch-1.20-SNAPSHOT - A version string to advertise in the User-Agent + A version string to advertise in the User-Agent header. @@ -346,7 +346,7 @@ http.proxy.exception.list - A comma separated list of hosts that don't use the proxy + A comma separated list of hosts that don't use the proxy (e.g. intranets). Example: www.apache.org @@ -377,7 +377,7 @@ Value of the "Accept-Language" request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group. - To send requests without "Accept-Language" header field, thi property must + To send requests without "Accept-Language" header field, this property must be configured to contain a space character because an empty property does not overwrite the default. @@ -402,8 +402,8 @@ http.store.responsetime true - Enables us to record the response time of the - host which is the time period between start connection to end + Enables us to record the response time of the + host which is the time period between start connection to end connection of a pages host. The response time in milliseconds is stored in CrawlDb in CrawlDatum's meta data
[nutch] branch master updated: NUTCH-2972 Javadoc build fails using JDK 17 - fix Javadoc issues when building with JDK 17
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new a92878df1 NUTCH-2972 Javadoc build fails using JDK 17 - fix Javadoc issues when building with JDK 17 a92878df1 is described below commit a92878df1ea586057dc8bc7e9ade376a9b8edc20 Author: Sebastian Nagel AuthorDate: Fri Feb 24 17:16:27 2023 +0100 NUTCH-2972 Javadoc build fails using JDK 17 - fix Javadoc issues when building with JDK 17 --- src/java/org/apache/nutch/segment/SegmentMerger.java | 14 -- src/java/org/apache/nutch/tools/arc/ArcRecordReader.java | 16 +++- .../apache/nutch/urlfilter/suffix/SuffixURLFilter.java | 8 +--- 3 files changed, 20 insertions(+), 18 deletions(-) diff --git a/src/java/org/apache/nutch/segment/SegmentMerger.java b/src/java/org/apache/nutch/segment/SegmentMerger.java index 056df3c88..6bb90e472 100644 --- a/src/java/org/apache/nutch/segment/SegmentMerger.java +++ b/src/java/org/apache/nutch/segment/SegmentMerger.java @@ -76,7 +76,9 @@ import org.apache.nutch.util.NutchJob; * * Also, it's possible to slice the resulting segment into chunks of fixed size. * - * Important Notes Which parts are merged? + * + * Important Notes + * Which parts are merged? * * It doesn't make sense to merge data from segments, which are at different * stages of processing (e.g. one unfetched segment, one fetched but not parsed, @@ -87,14 +89,14 @@ import org.apache.nutch.util.NutchJob; * fall back to just merging fetchlists, and it will skip all other data from * all segments. * - * Merging fetchlists + * Merging fetchlists * * Merging segments, which contain just fetchlists (i.e. prior to fetching) is * not recommended, because this tool (unlike the * {@link org.apache.nutch.crawl.Generator} doesn't ensure that fetchlist parts * for each map task are disjoint. * - * Duplicate content + * Duplicate content * Merging segments removes older content whenever possible (see below). * However, this is NOT the same as de-duplication, which in addition removes * identical content found at different URL-s. In other words, running @@ -108,15 +110,15 @@ import org.apache.nutch.util.NutchJob; * segments be named in an increasing lexicographic order as their creation time * increases. * - * Merging and indexes + * Merging and indexes * * Merged segment gets a different name. Since Indexer embeds segment names in * indexes, any indexes originally created for the input segments will NOT work * with the merged segment. Newly created merged segment(s) need to be indexed * afresh. This tool doesn't use existing indexes in any way, so if you plan to * merge segments you don't have to index them prior to merging. - * - * @author Andrzej Bialecki + * + * */ public class SegmentMerger extends Configured implements Tool{ private static final Logger LOG = LoggerFactory diff --git a/src/java/org/apache/nutch/tools/arc/ArcRecordReader.java b/src/java/org/apache/nutch/tools/arc/ArcRecordReader.java index 0a93947e4..b514a63fc 100644 --- a/src/java/org/apache/nutch/tools/arc/ArcRecordReader.java +++ b/src/java/org/apache/nutch/tools/arc/ArcRecordReader.java @@ -38,19 +38,17 @@ import org.apache.hadoop.util.ReflectionUtils; /** * The ArchRecordReader class provides a record reader which reads * records from arc files. - * + * * Arc files are essentially tars of gzips. Each record in an arc file is a * compressed gzip. Multiple records are concatenated together to form a - * complete arc. - * For more information on the arc file format - * @see ArcFileFormat. - * + * complete arc. * - * - * Arc files are used by the internet archive and grub projects. - * + * For more information on the arc file format + * @see ArcFileFormat. + + * Arc files are used by the Internet Archive and grub projects. * - * @see archive.org + * @see archive.org * @see grub.org */ public class ArcRecordReader extends RecordReader { diff --git a/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java b/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java index dd8605f79..5edf5fc38 100644 --- a/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java +++ b/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java @@ -78,6 +78,9 @@ import java.net.MalformedURLException; * expressions, it only accepts literal suffixes. I.e. a suffix "+*.jpg" is most * probably wrong, you should use "+.jpg" instead. * + * + * + * Examples * Example 1 * * The configuration shown below will accept all URLs with '.html' or '.htm' @@ -96,7 +99,7 @@ import java.net.MalformedURLException; * .htm *
[nutch] branch master updated: NUTCH-2982 Generator: parameter for URL normalization not passed forward - pass forward params `norm` and `maxNumSegments` - fix typos in Javadoc
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new ef2949691 NUTCH-2982 Generator: parameter for URL normalization not passed forward - pass forward params `norm` and `maxNumSegments` - fix typos in Javadoc ef2949691 is described below commit ef29496915d2c230466412d99ac4236a8e647932 Author: Sebastian Nagel AuthorDate: Fri Feb 17 16:12:26 2023 +0100 NUTCH-2982 Generator: parameter for URL normalization not passed forward - pass forward params `norm` and `maxNumSegments` - fix typos in Javadoc --- src/java/org/apache/nutch/crawl/Generator.java | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/java/org/apache/nutch/crawl/Generator.java b/src/java/org/apache/nutch/crawl/Generator.java index 8a2f87ba4..8e085428d 100644 --- a/src/java/org/apache/nutch/crawl/Generator.java +++ b/src/java/org/apache/nutch/crawl/Generator.java @@ -750,7 +750,7 @@ public class Generator extends NutchTool implements Tool { * @param curTime * Current time in milliseconds * @param filter whether to apply filtering operation - * @param norm whether to apply normilization operation + * @param norm whether to apply normalization operation * @param force if true, and the target lockfile exists, consider it valid. If false * and the target file exists, throw an IOException. * @param maxNumSegments maximum number of segments to generate @@ -768,8 +768,8 @@ public class Generator extends NutchTool implements Tool { long curTime, boolean filter, boolean norm, boolean force, int maxNumSegments, String expr) throws IOException, InterruptedException, ClassNotFoundException { -return generate(dbDir, segments, numLists, topN, curTime, filter, true, -force, 1, expr, null); +return generate(dbDir, segments, numLists, topN, curTime, filter, norm, +force, maxNumSegments, expr, null); } /** @@ -789,7 +789,7 @@ public class Generator extends NutchTool implements Tool { * @param curTime * Current time in milliseconds * @param filter whether to apply filtering operation - * @param norm whether to apply normilization operation + * @param norm whether to apply normalization operation * @param force if true, and the target lockfile exists, consider it valid. If false * and the target file exists, throw an IOException. * @param maxNumSegments maximum number of segments to generate
[nutch] 01/07: NUTCH-2920 -- first working attempt at migrating ElasticsearchIndexWriter to OpenSearch
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit ca3824fd98290dd7806752decfab6eb9e3b3b569 Author: tallison AuthorDate: Fri Feb 24 14:48:55 2023 -0500 NUTCH-2920 -- first working attempt at migrating ElasticsearchIndexWriter to OpenSearch --- LICENSE-binary | 1 + NOTICE-binary | 4 + conf/index-writers.xml.template| 27 ++ src/plugin/build.xml | 1 + src/plugin/indexer-opensearch-1x/README.md | 44 +++ src/plugin/indexer-opensearch-1x/build-ivy.xml | 47 +++ src/plugin/indexer-opensearch-1x/build.xml | 32 ++ src/plugin/indexer-opensearch-1x/ivy.xml | 46 +++ src/plugin/indexer-opensearch-1x/plugin.xml| 76 .../opensearch1x/OpenSearch1xConstants.java| 38 ++ .../opensearch1x/OpenSearch1xIndexWriter.java | 419 + .../indexwriter/opensearch1x/package-info.java | 22 ++ 12 files changed, 757 insertions(+) diff --git a/LICENSE-binary b/LICENSE-binary index d07d0a6a3..8e24a728e 100644 --- a/LICENSE-binary +++ b/LICENSE-binary @@ -505,6 +505,7 @@ org.jetbrains.kotlin:kotlin-stdlib-jdk8 org.lz4:lz4-java org.mapdb:mapdb org.netpreserve.commons:webarchive-commons +org.opensearch.client:opensearch-rest-high-level-client org.seleniumhq.selenium:htmlunit-driver org.seleniumhq.selenium:selenium-api org.seleniumhq.selenium:selenium-chrome-driver diff --git a/NOTICE-binary b/NOTICE-binary index 83d65ffaf..1aab2cb41 100644 --- a/NOTICE-binary +++ b/NOTICE-binary @@ -1021,6 +1021,10 @@ mapdb (http://www.mapdb.org) webarchive-commons (https://github.com/iipc/webarchive-commons) - license: The Apache Software License, Version 2.0 +# org.opensearch.client:opensearch-rest-high-level-client +opensearch-rest-high-level-client (https://opensearch.org/) +- license: The Apache Software License, Version 2.0 + # org.ow2.asm:asm asm (http://asm.ow2.io/) - license: BSD-3-Clause diff --git a/conf/index-writers.xml.template b/conf/index-writers.xml.template index 9f5d7916c..221f5affe 100644 --- a/conf/index-writers.xml.template +++ b/conf/index-writers.xml.template @@ -128,6 +128,33 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/src/plugin/build.xml b/src/plugin/build.xml index db7d4d560..4d900c390 100755 --- a/src/plugin/build.xml +++ b/src/plugin/build.xml @@ -54,6 +54,7 @@ + diff --git a/src/plugin/indexer-opensearch-1x/README.md b/src/plugin/indexer-opensearch-1x/README.md new file mode 100644 index 0..b68557fae --- /dev/null +++ b/src/plugin/indexer-opensearch-1x/README.md @@ -0,0 +1,44 @@ +indexer-opensearch1x plugin for Nutch + + +**indexer-opensearch1x plugin** is used for sending documents from one or more segments to an OpenSearch server. The configuration for the index writers is on **conf/index-writers.xml** file, included in the official Nutch distribution and it's as follow: + +```xml + + +... + + +... + + +``` + +Each `` element has two mandatory attributes: + +* `` is a unique identification for each configuration. This feature allows Nutch to distinguish each configuration, even when they are for the same index writer. In addition, it allows to have multiple instances for the same index writer, but with different configurations. + +* `org.apache.nutch.indexwriter.opensearch1x.OpenSearch1x.IndexWriter` corresponds to the canonical name of the class that implements the IndexWriter extension point. This value should not be modified for the **indexer-opensearch1x plugin**. + +## Mapping + +The mapping section is explained [here](https://cwiki.apache.org/confluence/display/NUTCH/IndexWriters#IndexWriters-Mappingsection). The structure of this section is general for all index writers. + +## Parameters + +Each parameter has the form `` and the parameters for this index writer are: + +Parameter Name | Description | Default value +--|--|-- +host | Comma-separated list of hostnames to send documents to using [TransportClient](https://static.javadoc.io/org.opensearch/opensearch/1.3.8/org/opensearch/client/transport/TransportClient.html). Either host and port must be defined. | +port | The port to connect to using [TransportClient](https://static.javadoc.io/org.opensearch/opensearch/1.3.8/org/opensearch/client/transport/TransportClient.html). | 9300 +scheme | The scheme (http or https) to connect to OpenSearch server. | https +index | Default index to send documents to. | nutch +username | Username for auth credentials | admin +password | Password
[nutch] 06/07: fix template to include new key store info. remove unused auth
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit e03cad3f42b9be16f45b2012fc738106894ac332 Author: tallison AuthorDate: Wed Mar 1 15:34:08 2023 -0500 fix template to include new key store info. remove unused auth --- conf/index-writers.xml.template | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/conf/index-writers.xml.template b/conf/index-writers.xml.template index 221f5affe..549ebd4c9 100644 --- a/conf/index-writers.xml.template +++ b/conf/index-writers.xml.template @@ -136,10 +136,12 @@ - + + +
[nutch] 05/07: NUTCH-2920 -- improve username/pw logic and update README.md
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit 71fabb2a87ff81b78997133ab7c790afa4ea6157 Author: tallison AuthorDate: Wed Mar 1 13:48:57 2023 -0500 NUTCH-2920 -- improve username/pw logic and update README.md --- src/plugin/indexer-opensearch-1x/README.md | 24 +- .../opensearch1x/OpenSearch1xIndexWriter.java | 10 ++--- 2 files changed, 30 insertions(+), 4 deletions(-) diff --git a/src/plugin/indexer-opensearch-1x/README.md b/src/plugin/indexer-opensearch-1x/README.md index b68557fae..52e5844af 100644 --- a/src/plugin/indexer-opensearch-1x/README.md +++ b/src/plugin/indexer-opensearch-1x/README.md @@ -36,9 +36,31 @@ scheme | The scheme (http or https) to connect to OpenSearch server. | https index | Default index to send documents to. | nutch username | Username for auth credentials | admin password | Password for auth credentials | admin -auth | Whether to enable HTTP basic authentication with OpenSearch. Use `username` and `password` properties to configure your credentials. | false +trust.store.path | Path to the trust store | +trust.store.password | Password for trust store | +trust.store.type | Type of trust store | JKS +key.store.path | Path to the key store | +key.store.password | Password for the key and the key store | +key.store.type | Type of key store | JKS max.bulk.docs | Maximum size of the bulk in number of documents. | 250 max.bulk.size | Maximum size of the bulk in bytes. | 2500500 exponential.backoff.millis | Initial delay for the [BulkProcessor](https://static.javadoc.io/org.opensearch/opensearch/1.3.8/org/opensearch/action/bulk/BulkProcessor.html) exponential backoff policy. | 100 exponential.backoff.retries | Number of times the [BulkProcessor](https://static.javadoc.io/org.opensearch/opensearch/1.3.8/org/opensearch/action/bulk/BulkProcessor.html) exponential backoff policy should retry bulk operations. | 10 bulk.close.timeout | Number of seconds allowed for the [BulkProcessor](https://static.javadoc.io/org.opensearch/opensearch/1.3.8/org/opensearch/action/bulk/BulkProcessor.html) to complete its last operation. | 600 + +## Authentication and SSL/TLS + +It is highly recommended that users use at least basic authentication (modify the `username` and `password`!!!) and that they set up at least the trust store (1-way TLS). +For a "getting started" level introduction to setting up a trust store, see: [Connecting java-high-level-rest-client](https://opensearch.org/blog/connecting-java-high-level-rest-client-with-opensearch-over-https/). +For a more in depth treatment, see: [Configuring TLS certificates](https://opensearch.org/docs/latest/security/configuration/tls/). + +Users may opt for 2-way TLS and skip basic authentication (`username` and `password`). +To do this, specify both the `trust.store.*` parameters and the `key.store.*` parameters. + +If users do not specify at least 1-way TLS (trust-store), this indexer logs a warning that this is a bad idea(TM), and it will proceed by completely ignoring all the SSL security. + +## Design +This index writer was built to be as close as possible to Nutch's existing indexer-elastic code. We +therefore chose to use the to-be-deprecated-in-3.x `opensearch-rest-high-level-client`. +We should plan to migrate to the `java client` for 2.x, whenever the BulkProcessor has been added. +See the discussion on [NUTCH-2920](https://issues.apache.org/jira/projects/NUTCH/issues/NUTCH-2920). \ No newline at end of file diff --git a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java index ec516e250..878c55a09 100644 --- a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java +++ b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java @@ -194,6 +194,10 @@ public class OpenSearch1xIndexWriter implements IndexWriter { keyStorePath = parameters.get(OpenSearch1xConstants.KEY_STORE_PATH); keyStorePassword = parameters.get(OpenSearch1xConstants.KEY_STORE_PASSWORD); keyStoreType = parameters.get(OpenSearch1xConstants.KEY_STORE_TYPE, "JKS"); + +if (! StringUtils.isAllBlank(user) && password == null) { + throw new IllegalArgumentException("Must specify a password, even if empty, if a 'user' is specified."); +} boolean basicAuth = user != null && password != null; final CredentialsProvider credentialsProvider = new BasicCredentialsProvider(); @@ -262,9 +266,9 @@ public class OpenSearch1xIndexWriter implements IndexWriter { sslBuilder.loadTrustMaterial(trustStore.get(), null);
[nutch] 07/07: Add indexer-opensearch-1x to 4 more targets...feedback from sebastian-nagel
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit e8fd21090c0a1e387ee3b5796b7a3be11cf91293 Author: tballison AuthorDate: Fri Mar 3 14:48:20 2023 -0500 Add indexer-opensearch-1x to 4 more targets...feedback from sebastian-nagel --- build.xml| 3 +++ src/plugin/build.xml | 1 + 2 files changed, 4 insertions(+) diff --git a/build.xml b/build.xml index 594fabc24..cc88493f3 100644 --- a/build.xml +++ b/build.xml @@ -221,6 +221,7 @@ + @@ -738,6 +739,7 @@ + @@ -1242,6 +1244,7 @@ + diff --git a/src/plugin/build.xml b/src/plugin/build.xml index 4d900c390..e83f25273 100755 --- a/src/plugin/build.xml +++ b/src/plugin/build.xml @@ -195,6 +195,7 @@ +
[nutch] branch master updated (383aeca5d -> e8fd21090)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from 383aeca5d NUTCH-2980: Upgraded Selenium to 4.7.2 + HTMLUnit new ca3824fd9 NUTCH-2920 -- first working attempt at migrating ElasticsearchIndexWriter to OpenSearch new 6e149f495 NUTCH-2920 -- fix imports new f6b17177a NUTCH-2920 -- add keystore for 2-way tls; add back in no-tls option with a stern warning and possibly helpful links. new 5fc2839c4 NUTCH-2920 -- improve handling for missing trust.store.path in the index-writers.xml new 71fabb2a8 NUTCH-2920 -- improve username/pw logic and update README.md new e03cad3f4 fix template to include new key store info. remove unused auth new e8fd21090 Add indexer-opensearch-1x to 4 more targets...feedback from sebastian-nagel The 7 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: LICENSE-binary | 1 + NOTICE-binary | 4 + build.xml | 3 + conf/index-writers.xml.template| 29 ++ src/plugin/build.xml | 2 + src/plugin/indexer-opensearch-1x/README.md | 66 +++ .../{any23 => indexer-opensearch-1x}/build-ivy.xml | 2 +- .../build.xml | 2 +- .../ivy.xml| 2 +- src/plugin/indexer-opensearch-1x/plugin.xml| 76 .../opensearch1x/OpenSearch1xConstants.java} | 12 +- .../opensearch1x/OpenSearch1xIndexWriter.java | 472 + .../indexwriter/opensearch1x}/package-info.java| 4 +- 13 files changed, 666 insertions(+), 9 deletions(-) create mode 100644 src/plugin/indexer-opensearch-1x/README.md copy src/plugin/{any23 => indexer-opensearch-1x}/build-ivy.xml (95%) copy src/plugin/{indexer-elastic => indexer-opensearch-1x}/build.xml (94%) copy src/plugin/{indexer-elastic => indexer-opensearch-1x}/ivy.xml (94%) create mode 100644 src/plugin/indexer-opensearch-1x/plugin.xml copy src/plugin/{indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticConstants.java => indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xConstants.java} (76%) create mode 100644 src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java copy src/{java/org/apache/nutch/parse => plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x}/package-info.java (86%)
[nutch] 03/07: NUTCH-2920 -- add keystore for 2-way tls; add back in no-tls option with a stern warning and possibly helpful links.
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit f6b17177ad6049b5642d9510cb60fe0a1d3b5f1c Author: tallison AuthorDate: Wed Mar 1 12:16:17 2023 -0500 NUTCH-2920 -- add keystore for 2-way tls; add back in no-tls option with a stern warning and possibly helpful links. --- .../opensearch1x/OpenSearch1xConstants.java| 6 +- .../opensearch1x/OpenSearch1xIndexWriter.java | 137 +++-- 2 files changed, 99 insertions(+), 44 deletions(-) diff --git a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xConstants.java b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xConstants.java index 8ca5038dd..cb172bda2 100644 --- a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xConstants.java +++ b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xConstants.java @@ -20,14 +20,14 @@ public interface OpenSearch1xConstants { String HOSTS = "host"; String PORT = "port"; String SCHEME = "scheme"; - String USER = "username"; String PASSWORD = "password"; - String USE_AUTH = "auth"; - String TRUST_STORE_PATH = "trust.store.path"; String TRUST_STORE_PASSWORD = "trust.store.password"; String TRUST_STORE_TYPE = "trust.store.type"; + String KEY_STORE_PATH = "key.store.path"; + String KEY_STORE_PASSWORD = "key.store.password"; + String KEY_STORE_TYPE = "key.store.type"; String INDEX = "index"; String MAX_BULK_DOCS = "max.bulk.docs"; String MAX_BULK_LENGTH = "max.bulk.size"; diff --git a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java index e796a69e4..a121f15a2 100644 --- a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java +++ b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java @@ -22,6 +22,7 @@ import org.apache.http.HttpHost; import org.apache.http.auth.AuthScope; import org.apache.http.auth.UsernamePasswordCredentials; import org.apache.http.client.CredentialsProvider; +import org.apache.http.conn.ssl.TrustSelfSignedStrategy; import org.apache.http.impl.client.BasicCredentialsProvider; import org.apache.http.impl.nio.client.HttpAsyncClientBuilder; import org.apache.http.ssl.SSLContextBuilder; @@ -52,18 +53,20 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; import javax.net.ssl.SSLContext; -import java.io.File; import java.io.IOException; import java.io.InputStream; import java.lang.invoke.MethodHandles; import java.nio.file.Files; import java.nio.file.Paths; +import java.security.GeneralSecurityException; import java.security.KeyStore; import java.time.format.DateTimeFormatter; + import java.util.AbstractMap; import java.util.LinkedHashMap; import java.util.List; import java.util.Map; +import java.util.Optional; import java.util.concurrent.TimeUnit; /** @@ -80,18 +83,19 @@ public class OpenSearch1xIndexWriter implements IndexWriter { private static final int DEFAULT_EXP_BACKOFF_RETRIES = 10; private static final int DEFAULT_BULK_CLOSE_TIMEOUT = 600; private static final String DEFAULT_INDEX = "nutch"; - private static final String DEFAULT_USER = "elastic"; - + private static final String DEFAULT_USER = "admin"; + private static final String DEFAULT_PASSWORD = "admin"; private String[] hosts; private int port; - private String scheme = HttpHost.DEFAULT_SCHEME_NAME; - private String user = null; - private String password = null; - private boolean auth; - + private String scheme = "https"; + private String user; + private String password; private String trustStorePath; private String trustStorePassword; private String trustStoreType; + private String keyStorePath; + private String keyStorePassword; + private String keyStoreType; private int maxBulkDocs; private int maxBulkLength; private int expBackoffMillis; @@ -105,6 +109,7 @@ public class OpenSearch1xIndexWriter implements IndexWriter { private Configuration config; + @Override public void open(Configuration conf, String name) throws IOException { // Implementation not required @@ -125,7 +130,7 @@ public class OpenSearch1xIndexWriter implements IndexWriter { String hosts = parameters.get(OpenSearch1xConstants.HOSTS); if (StringUtils.isBlank(hosts)) { - String message = &
[nutch] 04/07: NUTCH-2920 -- improve handling for missing trust.store.path in the index-writers.xml
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit 5fc2839c447a1b3695b4bcb507d428d32ff27281 Author: tallison AuthorDate: Wed Mar 1 13:28:07 2023 -0500 NUTCH-2920 -- improve handling for missing trust.store.path in the index-writers.xml --- .../nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java| 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java index a121f15a2..ec516e250 100644 --- a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java +++ b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java @@ -253,9 +253,6 @@ public class OpenSearch1xIndexWriter implements IndexWriter { } private SSLContext createSSLContext() throws GeneralSecurityException, IOException { -if (trustStorePath == null && keyStorePath == null) { - return SSLContexts.createDefault(); -} SSLContextBuilder sslBuilder = SSLContexts.custom(); Optional trustStore = loadStore(trustStorePath, trustStorePassword, trustStoreType); @@ -283,8 +280,8 @@ public class OpenSearch1xIndexWriter implements IndexWriter { if (StringUtils.isAllBlank(storePath)) { return Optional.empty(); } -if (StringUtils.isAllBlank(storePassword)) { - throw new IllegalArgumentException("must include a password for store: " + storePath); +if (storePassword == null) { + throw new IllegalArgumentException("must include a non-null password for store: " + storePath); } KeyStore store = KeyStore.getInstance(storeType);
[nutch] 02/07: NUTCH-2920 -- fix imports
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit 6e149f4954a0b7b21120b8e1467a07a82c60e66e Author: tallison AuthorDate: Fri Feb 24 15:22:16 2023 -0500 NUTCH-2920 -- fix imports --- .../apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java | 3 --- 1 file changed, 3 deletions(-) diff --git a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java index c31fbf17d..e796a69e4 100644 --- a/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java +++ b/src/plugin/indexer-opensearch-1x/src/java/org/apache/nutch/indexwriter/opensearch1x/OpenSearch1xIndexWriter.java @@ -22,8 +22,6 @@ import org.apache.http.HttpHost; import org.apache.http.auth.AuthScope; import org.apache.http.auth.UsernamePasswordCredentials; import org.apache.http.client.CredentialsProvider; -import org.apache.http.conn.ssl.NoopHostnameVerifier; -import org.apache.http.conn.ssl.TrustSelfSignedStrategy; import org.apache.http.impl.client.BasicCredentialsProvider; import org.apache.http.impl.nio.client.HttpAsyncClientBuilder; import org.apache.http.ssl.SSLContextBuilder; @@ -33,7 +31,6 @@ import org.apache.nutch.indexer.IndexWriterParams; import org.apache.nutch.indexer.NutchDocument; import org.apache.nutch.indexer.NutchField; import org.apache.nutch.util.StringUtil; -import org.checkerframework.checker.units.qual.K; import org.opensearch.action.DocWriteRequest; import org.opensearch.action.bulk.BackoffPolicy; import org.opensearch.action.bulk.BulkProcessor;
[nutch] branch master updated: NUTCH-2980: Upgraded Selenium to 4.7.2 + HTMLUnit
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 383aeca5d NUTCH-2980: Upgraded Selenium to 4.7.2 + HTMLUnit 383aeca5d is described below commit 383aeca5d30342b29b6ee6e05f8f3052c62d7303 Author: Kamil Mroczek AuthorDate: Thu Jan 19 23:05:05 2023 -0500 NUTCH-2980: Upgraded Selenium to 4.7.2 + HTMLUnit - Removed phantomJS dependency as it wasn't being used and the project has been archived since 2018 - it was causing problems casting TakeScreenshot to HtmlUnitWebDriver - Improved README setup instructions for IntelliJ --- README.md | 44 - src/plugin/lib-htmlunit/ivy.xml| 12 +- src/plugin/lib-htmlunit/plugin.xml | 214 - src/plugin/lib-selenium/ivy.xml| 7 +- src/plugin/lib-selenium/plugin.xml | 170 ++-- .../nutch/protocol/selenium/HttpWebClient.java | 28 --- .../handlers/DefaultClickAllAjaxLinksHandler.java | 7 +- 7 files changed, 361 insertions(+), 121 deletions(-) diff --git a/README.md b/README.md index a0ab67bd1..ffd04ae22 100644 --- a/README.md +++ b/README.md @@ -40,6 +40,8 @@ To contribute a patch, follow these instructions (note that installing IDE setup = +### Eclipse + Generate Eclipse project files ``` @@ -48,13 +50,45 @@ ant eclipse and follow the instructions in [Importing existing projects](https://help.eclipse.org/2019-06/topic/org.eclipse.platform.doc.user/tasks/tasks-importproject.htm). -For Intellij IDEA, first install the [IvyIDEA Plugin](https://plugins.jetbrains.com/plugin/3612-ivyidea). then run ```ant eclipse```. +You must [configure the nutch-site.xml](https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse) before running. Make sure, you've added ```http.agent.name``` and ```plugin.folders``` properties. The plugin.folders normally points to ```/build/plugins```. + +Now create a Java Application Configuration, choose org.apache.nutch.crawl.Injector, add two paths as arguments. First one is the crawldb directory, second one is the URL directory where, the injector can read urls. Now run your configuration. -Then open the project in IntelliJ. You may see popups like "Ant build scripts found", "Frameworks detected - IvyIDEA Framework detected". Just follow the simple steps in these dialogs. +If we still see the ```No plugins found on paths of property plugin.folders="plugins"```, update the plugin.folders in the nutch-default.xml, this is a quick fix, but should not be used. -You must [configure the nutch-site.xml](https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse) before running. Make sure, you've added ```http.agent.name``` and ```plugin.folders``` properties. The plugin.folders normally points to ```/build/plugins```. -Now create a Java Application Configuration, choose org.apache.nutch.crawl.Injector, add two paths as arguments. First one is the crawldb directory, second one is the URL directory where, the injector can read urls. Now run your configuration. +### Intellij IDEA -If we still see the ```No plugins found on paths of property plugin.folders="plugins"```, update the plugin.folders in the nutch-default.xml, this is a quick fix, but should not be used. +First install the [IvyIDEA Plugin](https://plugins.jetbrains.com/plugin/3612-ivyidea). then run ```ant eclipse```. This will create the necessary +.classpath and .project files so that Intellij can import the project in the next step. + +In Intellij IDEA, select File > New > Project from Existing Sources. Select the nutch home directory and click "Open". + +On the "Import Project" screen select the "Import project from external model" radio button and select "Eclipse". +Click "Create". On the next screen the "Eclipse projects directory" should be already set to the nutch folder. +Leave the "Create module files near .classpath files" radio button selected. +Click "Next" on the next screens. On the project SDK screen select Java 11 and click "Create". + +Once the project is imported, you will see a popup saying "Ant build scripts found", "Frameworks detected - IvyIDEA Framework detected". Click "Import". +If you don't get the pop-up, I'd suggest going through the steps again as this happens from time to time. There is another +Ant popup that asks you to configure the project. Do NOT click "Configure". + +To import the code-style, Go to Intellij IDEA > Preferences > Editor > Code Style > Java. + +For the Scheme dropdown select "Project". Click
[nutch] branch master updated: NUTCH-2974 Ant build fails with "Unparseable date" on certain platforms
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 541e6936d NUTCH-2974 Ant build fails with "Unparseable date" on certain platforms new 19dbe7866 Merge pull request #752 from sebastian-nagel/NUTCH-2974 541e6936d is described below commit 541e6936dfb1a07fe4c915b8b95c6b5cfdf2aeb0 Author: Sebastian Nagel AuthorDate: Mon Jan 16 14:22:20 2023 +0100 NUTCH-2974 Ant build fails with "Unparseable date" on certain platforms --- build.xml | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/build.xml b/build.xml index 004a12191..594fabc24 100644 --- a/build.xml +++ b/build.xml @@ -102,7 +102,12 @@ - + +
[nutch] branch master updated: NUTCH-2634 Some links marked as "nofollow" are followed anyway - fix detection of nofollow in multi-valued rel attributes
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new dfdd00f31 NUTCH-2634 Some links marked as "nofollow" are followed anyway - fix detection of nofollow in multi-valued rel attributes new 9a1ed4015 Merge pull request #751 from sebastian-nagel/NUTCH-2634 dfdd00f31 is described below commit dfdd00f3189839b6ed7d60651e5daa33f0038265 Author: Sebastian Nagel AuthorDate: Thu Jan 5 22:53:00 2023 +0100 NUTCH-2634 Some links marked as "nofollow" are followed anyway - fix detection of nofollow in multi-valued rel attributes --- .../org/apache/nutch/parse/html/DOMContentUtils.java | 9 +++-- .../apache/nutch/parse/html/TestDOMContentUtils.java | 17 - .../org/apache/nutch/parse/tika/DOMContentUtils.java | 6 +- .../apache/nutch/parse/tika/TestDOMContentUtils.java | 18 -- 4 files changed, 36 insertions(+), 14 deletions(-) diff --git a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java index 2415e8568..76685675b 100644 --- a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java +++ b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java @@ -23,6 +23,7 @@ import java.util.ArrayList; import java.util.HashMap; import java.util.HashSet; import java.util.Set; +import java.util.regex.Pattern; import org.apache.nutch.parse.Outlink; import org.apache.nutch.util.NodeWalker; @@ -30,6 +31,7 @@ import org.apache.nutch.util.URLUtil; import org.w3c.dom.NamedNodeMap; import org.w3c.dom.Node; import org.w3c.dom.NodeList; + import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.MapWritable; import org.apache.hadoop.io.Text; @@ -42,7 +44,10 @@ import org.apache.hadoop.io.Text; * */ public class DOMContentUtils { - + + private static Pattern NOFOLLOW_PATTERN = Pattern.compile("\\bnofollow\\b", + Pattern.CASE_INSENSITIVE); + private String srcTagMetaName; private boolean keepNodenames; private Set blockNodes; @@ -451,7 +456,7 @@ public class DOMContentUtils { if (params.attrName.equalsIgnoreCase(attrName)) { target = attr.getNodeValue(); } else if ("rel".equalsIgnoreCase(attrName) - && "nofollow".equalsIgnoreCase(attr.getNodeValue())) { + && NOFOLLOW_PATTERN.matcher(attr.getNodeValue()).find()) { noFollow = true; } else if ("method".equalsIgnoreCase(attrName) && "post".equalsIgnoreCase(attr.getNodeValue())) { diff --git a/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java b/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java index 0c1212a50..d50e9052d 100644 --- a/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java +++ b/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java @@ -103,6 +103,11 @@ public class TestDOMContentUtils { + "http://www.nutch.org\; rel=\"nofollow\"> ignore " + "http://www.nutch.org\;> ignore " + ""), + // multiple space-separated rel values (NUTCH-2634) + new String("" + + "http://www.nutch.org\; rel=\"noreferrer nofollow\"> ignore " + + "http://www.nutch.org\;> ignore " + + ""), // test that POST form actions are skipped new String("" + "" @@ -132,13 +137,13 @@ public class TestDOMContentUtils { + "" + "" + ""), }; - private static int SKIP = 9; + private static int SKIP = 10; private static String[] testBaseHrefs = { "http://www.nutch.org;, "http://www.nutch.org/docs/foo.html;, "http://www.nutch.org/docs/;, "http://www.nutch.org/docs/;, "http://www.nutch.org/frames/;, "http://www.nutch.org/maps/;, "http://www.nutch.org/whitespace/;, - "http://www.nutch.org//;, "http://www.nutch.org/;, + "http://www.nutch.org//;, "http://www.nutch.org//;, "http://www.nutch.org/;, "http://www.nutch.org/;, "http://www.nutch.org/;, "http://www.nutch.org/;something;, "http://www.nutch.org/;, "http://www.nutch.org/; }; @@ -159,12 +164,13 @@ public class TestDOMContentUtils { + "Tabs are spaces too. This is a break -> and the line after break . " + "
[nutch] branch master updated (85f7bcb63 -> ed7b6615b)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from 85f7bcb63 Prepare for new development after release of 1.19 - bump version number (-> 1.20-NAPSHOT) new 989c2ca8d NUTCH-2883 Provide means to run server and webapp as persistent services in Docker container new 0bda1bded NUTCH-2883 Provide means to run server and webapp as persistent services in Docker container - move ARG instructions into FROM block they're used in (duplicate if necessary) new 7c1a48cfa NUTCH-2883 Provide means to run server and webapp as persistent services in Docker container - install Nutch WebApp from separate repository (see NUTCH-2886) and run it via `mvn jetty:run -Djetty.port= - sync log paths in supervisord config files new ed7b6615b Merge pull request #748 from sebastian-nagel/NUTCH-2883-docker The 3338 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../.dockerfilelintrc | 4 +- docker/Dockerfile | 88 -- docker/README.md | 71 ++--- docker/config/supervisord_startserver.conf | 47 docker/config/supervisord_startserver_webapp.conf | 69 + 5 files changed, 263 insertions(+), 16 deletions(-) copy conf/domain-urlfilter.txt.template => docker/.dockerfilelintrc (94%) create mode 100644 docker/config/supervisord_startserver.conf create mode 100644 docker/config/supervisord_startserver_webapp.conf
svn commit: r56776 - /release/nutch/1.18/
Author: snagel Date: Sat Sep 10 13:19:52 2022 New Revision: 56776 Log: Remove 1.18 after release of 1.19 Removed: release/nutch/1.18/
[nutch] 02/02: Prepare for new development after release of 1.19 - bump version number (-> 1.20-NAPSHOT)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit 85f7bcb63ee801bdfb0b41ca2555160583105ea2 Author: Sebastian Nagel AuthorDate: Thu Sep 8 16:28:27 2022 +0200 Prepare for new development after release of 1.19 - bump version number (-> 1.20-NAPSHOT) --- conf/nutch-default.xml | 2 +- default.properties | 2 +- src/bin/nutch | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index a908bdb16..d05503d23 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -184,7 +184,7 @@ http.agent.version - Nutch-1.19 + Nutch-1.20-SNAPSHOT A version string to advertise in the User-Agent header. diff --git a/default.properties b/default.properties index 38a070e26..df96199c1 100644 --- a/default.properties +++ b/default.properties @@ -14,7 +14,7 @@ # limitations under the License. name=apache-nutch -version=1.19 +version=1.20-SNAPSHOT final.name=${name}-${version} year=2022 diff --git a/src/bin/nutch b/src/bin/nutch index 3359c7be1..5b999fa6f 100755 --- a/src/bin/nutch +++ b/src/bin/nutch @@ -61,7 +61,7 @@ done # if no args specified, show usage if [ $# = 0 ]; then - echo "nutch 1.19" + echo "nutch 1.20-SNAPSHOT" echo "Usage: nutch COMMAND [-Dproperty=value]... [command-specific args]..." echo "where COMMAND is one of:" echo " readdbread / dump crawl db"
[nutch] branch master updated (ffe059892 -> 85f7bcb63)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from ffe059892 NUTCH-2969 Javadoc: Javascript search is not working when built on JDK 11 - pass --no-module-directories to javadoc target when building on JDK 11 - remove obsolete condition to fail javadoc builds on JDK 7u25 and earlier new 27cf929b8 Nutch 1.19 release - update current year in API docs etc. - update version number - add changes / release notes - update links to Hadoop API docs new 85f7bcb63 Prepare for new development after release of 1.19 - bump version number (-> 1.20-NAPSHOT) The 2 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: CHANGES.txt| 110 - conf/nutch-default.xml | 2 +- default.properties | 9 ++-- src/bin/nutch | 2 +- 4 files changed, 114 insertions(+), 9 deletions(-)
[nutch] 01/02: Nutch 1.19 release - update current year in API docs etc. - update version number - add changes / release notes - update links to Hadoop API docs
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit 27cf929b83ba86b896762dd4970e445069e514ae Author: Sebastian Nagel AuthorDate: Mon Aug 22 15:57:41 2022 +0200 Nutch 1.19 release - update current year in API docs etc. - update version number - add changes / release notes - update links to Hadoop API docs --- CHANGES.txt| 110 - conf/nutch-default.xml | 2 +- default.properties | 9 ++-- src/bin/nutch | 2 +- 4 files changed, 114 insertions(+), 9 deletions(-) diff --git a/CHANGES.txt b/CHANGES.txt index 822bd4acf..adea4478f 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,9 +1,117 @@ # Nutch Change Log +Nutch 1.19 Release 22/08/2022 (dd/mm/) +Release Report: https://s.apache.org/lf6li + Breaking Changes -- the plugin parse-swf for parsing Shockwave/Adobe Flash conent was removed (NUTCH-2861) +- Nutch is built on JDK 11 (NUTCH-2857) +- the Nutch WebApp was moved to a separate repository (NUTCH-2886) + see https://github.com/apache/nutch-webapp + https://gitbox.apache.org/repos/asf?p=nutch-webapp.git +- the plugin parse-swf for parsing Shockwave/Adobe Flash content was removed (NUTCH-2861) + +Sub-task + +[NUTCH-2819] - Move spotbugs "installation" directory to avoid that spotbugs is shipped in Nutch runtime +[NUTCH-2846] - Fix various bugs spotted by NUTCH-2815 +[NUTCH-2850] - Method ignores exceptional return value +[NUTCH-2851] - Random object created and used only once +[NUTCH-2855] - Update org.elasticsearch.client + +Bug + +[NUTCH-2290] - Update licenses of bundled libraries +[NUTCH-2512] - Nutch does not build under JDK9 +[NUTCH-2821] - Deduplicate licenses in LICENSE.txt file +[NUTCH-2822] - Split the LICENSE.txt file into two files for source resp. binary releases +[NUTCH-2831] - Elastic indexer does not support SSL +[NUTCH-2843] - Duplicate declaration of dependencies in ivy.xml +[NUTCH-2858] - urlnormalizer-protocol: URL port is lost during normalization +[NUTCH-2862] - Do not include Ivy jar in source release package +[NUTCH-2863] - Injector to parse command-line flags case-insensitive +[NUTCH-2866] - MetaData.toString() should return "key=value ..." +[NUTCH-2868] - urlnormalizer-protocol fails with StringIndexOutOfBoundsException when reading invalid line in configuration file +[NUTCH-2881] - bug in 'nutch' symlink in docker container +[NUTCH-2889] - nutch indexer-elasticsearch plugin, doesn't work with https protocol +[NUTCH-2890] - Protocol-okhttp: upgrade okhttp to 4.9.1 to address infinite connection retries +[NUTCH-2894] - Java plugin compilation classpath: priorize plugin dependencies +[NUTCH-2899] - Remove needless warning about missing o/a/rat/anttasks/antlib.xml +[NUTCH-2902] - Jexl parsing error on statements +[NUTCH-2905] - Mask sensitive strings in log output of index writers +[NUTCH-2910] - FetchItemQueues overloaded constructor also interprets fetcher timeout as -1 e.g. no-timeout. +[NUTCH-2915] - Upgrade to log4j 2.15.0 +[NUTCH-2916] - Fix log file rotation / rename default log file +[NUTCH-2917] - Remove transitive dependency to log4j 1.x +[NUTCH-2922] - Upgrade to log4j 2.17.0 +[NUTCH-2935] - DeduplicationJob: failure on URLs with invalid percent encoding +[NUTCH-2936] - Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used +[NUTCH-2945] - Solr Index Writer pluging schema.xml missing a copyToField +[NUTCH-2947] - Fetcher: keep state of empty fetch queues unless queue feeder is finished +[NUTCH-2949] - Tasks of a multi-threaded map runner may fail because of slow creation of URL stream handlers +[NUTCH-2951] - Crawl datum with metadata WRITABLE_GENERATE_TIME_KEY awaits fetching forever +[NUTCH-2955] - indexer-solr: replace deprecated/removed field type solr.LatLonType +[NUTCH-2969] - Javadoc: Javascript search is not working when built on JDK 11 + +New Feature + +[NUTCH-2901] - migrate to maven or gradle + +Improvement + +[NUTCH-1403] - Add default ScoringFilter for manipulating metadata +[NUTCH-2429] - Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers +[NUTCH-2449] - Usage of Tika LanguageIdentifier in language-identifier plugin +[NUTCH-2573] - Suspend crawling if robots.txt fails to fetch with 5xx status +[NUTCH-2795] - CrawlDbReader: compress CrawlDb dumps if configured +[NUTCH-2807] - SitemapProcessor to warn that ignoring robots.txt affects detection of sitemaps +[NUTCH-2808] - Document side effects of ignoring robots.txt +[NUTCH-2840] - Fix 'report-vulnerabilities' ant target in b
[nutch-site] 02/02: Announce release of Nutch 1.19 - fix release data in announcement
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/nutch-site.git commit aa45c17bf678c601f4f691dfbdca77380aea5edd Author: Sebastian Nagel AuthorDate: Thu Sep 8 15:25:32 2022 +0200 Announce release of Nutch 1.19 - fix release data in announcement --- content/news/nutch-1.19-release.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/news/nutch-1.19-release.md b/content/news/nutch-1.19-release.md index 8a1d135..774345b 100644 --- a/content/news/nutch-1.19-release.md +++ b/content/news/nutch-1.19-release.md @@ -1,5 +1,5 @@ +++ -date = "2021-09-08" +date = "2022-08-22" title = "Nutch 1.19 Release" tags = ["1.19","release"] categories = ["releases"]
[nutch-site] branch main updated (4efc5a9 -> aa45c17)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch main in repository https://gitbox.apache.org/repos/asf/nutch-site.git from 4efc5a9 NUTCH-1999 Add /robots.txt to Nutch site (#1) new 73e90d4 Announce release of Nutch 1.19 new aa45c17 Announce release of Nutch 1.19 - fix release data in announcement The 2 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: content/doap.rdf | 7 + .../javadoc/apidocs/allclasses-frame.html | 549 - .../javadoc/apidocs/allclasses-index.html | 2845 + .../{allclasses-noframe.html => allclasses.html} |83 +- .../javadoc/apidocs/allpackages-index.html | 826 ++ .../javadoc/apidocs/constant-values.html | 2415 ++-- .../javadoc/apidocs/deprecated-list.html | 155 +- .../javadoc/apidocs/{package-list => element-list} |18 +- .../documentation/javadoc/apidocs/help-doc.html| 169 +- .../documentation/javadoc/apidocs/index-all.html | 7425 ++-- content/documentation/javadoc/apidocs/index.html | 885 +- .../apidocs/jquery/external/jquery/jquery.js | 10872 ++ .../jquery/images/ui-bg_glass_55_fbf9ee_1x400.png | Bin 0 -> 335 bytes .../jquery/images/ui-bg_glass_65_dadada_1x400.png | Bin 0 -> 262 bytes .../jquery/images/ui-bg_glass_75_dadada_1x400.png | Bin 0 -> 262 bytes .../jquery/images/ui-bg_glass_75_e6e6e6_1x400.png | Bin 0 -> 262 bytes .../jquery/images/ui-bg_glass_95_fef1ec_1x400.png | Bin 0 -> 332 bytes .../ui-bg_highlight-soft_75_cc_1x100.png | Bin 0 -> 280 bytes .../jquery/images/ui-icons_22_256x240.png | Bin 0 -> 6922 bytes .../jquery/images/ui-icons_2e83ff_256x240.png | Bin 0 -> 4549 bytes .../jquery/images/ui-icons_454545_256x240.png | Bin 0 -> 6992 bytes .../jquery/images/ui-icons_88_256x240.png | Bin 0 -> 6999 bytes .../jquery/images/ui-icons_cd0a0a_256x240.png | Bin 0 -> 4549 bytes .../javadoc/apidocs/jquery/jquery-3.5.1.js | 10872 ++ .../javadoc/apidocs/jquery/jquery-ui.css | 582 + .../javadoc/apidocs/jquery/jquery-ui.js| 2659 + .../javadoc/apidocs/jquery/jquery-ui.min.css | 7 + .../javadoc/apidocs/jquery/jquery-ui.min.js| 6 + .../javadoc/apidocs/jquery/jquery-ui.structure.css | 156 + .../apidocs/jquery/jquery-ui.structure.min.css | 5 + .../jquery/jszip-utils/dist/jszip-utils-ie.js |56 + .../jquery/jszip-utils/dist/jszip-utils-ie.min.js |10 + .../apidocs/jquery/jszip-utils/dist/jszip-utils.js | 118 + .../jquery/jszip-utils/dist/jszip-utils.min.js |10 + .../javadoc/apidocs/jquery/jszip/dist/jszip.js | 11370 +++ .../javadoc/apidocs/jquery/jszip/dist/jszip.min.js |13 + .../javadoc/apidocs/member-search-index.js | 1 + .../javadoc/apidocs/member-search-index.zip| Bin 0 -> 40331 bytes .../nutch/analysis/lang/HTMLLanguageParser.html| 208 +- .../analysis/lang/LanguageIndexingFilter.html | 211 +- .../lang/class-use/HTMLLanguageParser.html |94 +- .../lang/class-use/LanguageIndexingFilter.html |94 +- .../apache/nutch/analysis/lang/package-frame.html |21 - .../nutch/analysis/lang/package-summary.html | 116 +- .../apache/nutch/analysis/lang/package-tree.html |98 +- .../apache/nutch/analysis/lang/package-use.html|90 +- .../apache/nutch/any23/Any23IndexingFilter.html| 257 +- .../org/apache/nutch/any23/Any23ParseFilter.html | 266 +- .../nutch/any23/class-use/Any23IndexingFilter.html |94 +- .../nutch/any23/class-use/Any23ParseFilter.html|94 +- .../org/apache/nutch/any23/package-frame.html |21 - .../org/apache/nutch/any23/package-summary.html| 124 +- .../org/apache/nutch/any23/package-tree.html |98 +- .../org/apache/nutch/any23/package-use.html|90 +- .../apache/nutch/collection/CollectionManager.html | 272 +- .../org/apache/nutch/collection/Subcollection.html | 417 +- .../collection/class-use/CollectionManager.html| 121 +- .../nutch/collection/class-use/Subcollection.html | 142 +- .../org/apache/nutch/collection/package-frame.html |21 - .../apache/nutch/collection/package-summary.html | 170 +- .../org/apache/nutch/collection/package-tree.html | 100 +- .../org/apache/nutch/collection/package-use.html | 114 +- .../apache/nutch/crawl/AbstractFetchSchedule.html | 321 +- .../apache/nutch/crawl/AdaptiveFetchSchedule.html | 248 +- .../apache/nutch/crawl/CrawlDatum.Comparator.html | 167 +- .../apidocs/
[nutch-site] branch asf-site updated: Announce release of Nutch 1.19 - fix release data in announcement
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/asf-site by this push: new 956a142 Announce release of Nutch 1.19 - fix release data in announcement 956a142 is described below commit 956a1425b97c07e2e7469296d810d28f70667a50 Author: Sebastian Nagel AuthorDate: Thu Sep 8 15:26:59 2022 +0200 Announce release of Nutch 1.19 - fix release data in announcement --- content/categories/index.xml | 4 ++-- content/categories/releases/index.xml | 4 ++-- content/index.xml | 4 ++-- content/news/index.xml | 4 ++-- content/news/nutch-1.19-release/index.html | 4 ++-- content/sitemap.xml| 16 content/tags/1.19/index.xml| 4 ++-- content/tags/index.xml | 6 +++--- content/tags/release/index.xml | 4 ++-- 9 files changed, 25 insertions(+), 25 deletions(-) diff --git a/content/categories/index.xml b/content/categories/index.xml index cb64e1f..9f537a6 100644 --- a/content/categories/index.xml +++ b/content/categories/index.xml @@ -6,11 +6,11 @@ Recent content in Categories on Apache Nutch™ Hugo -- gohugo.io en-us -Wed, 08 Sep 2021 00:00:00 + +Mon, 22 Aug 2022 00:00:00 + releases /categories/releases/ - Wed, 08 Sep 2021 00:00:00 + + Mon, 22 Aug 2022 00:00:00 + /categories/releases/ diff --git a/content/categories/releases/index.xml b/content/categories/releases/index.xml index 5ea4121..80e9797 100644 --- a/content/categories/releases/index.xml +++ b/content/categories/releases/index.xml @@ -6,11 +6,11 @@ Recent content in releases on Apache Nutch™ Hugo -- gohugo.io en-us -Wed, 08 Sep 2021 00:00:00 + +Mon, 22 Aug 2022 00:00:00 + Nutch 1.19 Release /news/nutch-1.19-release/ - Wed, 08 Sep 2021 00:00:00 + + Mon, 22 Aug 2022 00:00:00 + /news/nutch-1.19-release/ The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.19, we advise all current users and developers of the 1.X series to upgrade to this release. diff --git a/content/index.xml b/content/index.xml index 8a0e2c4..f2fd700 100644 --- a/content/index.xml +++ b/content/index.xml @@ -6,11 +6,11 @@ Recent content on Apache Nutch™ Hugo -- gohugo.io en-us -Wed, 08 Sep 2021 00:00:00 + +Mon, 22 Aug 2022 00:00:00 + Nutch 1.19 Release /news/nutch-1.19-release/ - Wed, 08 Sep 2021 00:00:00 + + Mon, 22 Aug 2022 00:00:00 + /news/nutch-1.19-release/ The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.19, we advise all current users and developers of the 1.X series to upgrade to this release. diff --git a/content/news/index.xml b/content/news/index.xml index ff055e8..d129c73 100644 --- a/content/news/index.xml +++ b/content/news/index.xml @@ -6,11 +6,11 @@ Recent content in Project News on Apache Nutch™ Hugo -- gohugo.io en-us -Wed, 08 Sep 2021 00:00:00 + +Mon, 22 Aug 2022 00:00:00 + Nutch 1.19 Release /news/nutch-1.19-release/ - Wed, 08 Sep 2021 00:00:00 + + Mon, 22 Aug 2022 00:00:00 + /news/nutch-1.19-release/ The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.19, we advise all current users and developers of the 1.X series to upgrade to this release. diff --git a/content/news/nutch-1.19-release/index.html b/content/news/nutch-1.19-release/index.html index 10c420b..fb55985 100644 --- a/content/news/nutch-1.19-release/index.html +++ b/content/news/nutch-1.19-release/index.html @@ -25,8 +25,8 @@ - - + + diff --git a/content/sitemap.xml b/content/sitemap.xml index e3fa139..5570667 100644 --- a/content/sitemap.xml +++ b/content/sitemap.xml @@ -5,7 +5,7 @@ https://nutch.apache.org/news/nutch-1.19-release/ - 2021-09-08T00:00:00+00:00 + 2022-08-22T00:00:00+00:00 https://nutch.apache.org/news/nutch-1.18-release/ @@ -65,31 +65,31 @@ https://nutch.apache.org/tags/1.19/ - 2021-09-08T00:00:00+00:00 + 2022-08-22T00:00:00+00:00 https://nutch.apache.org/ - 2021-09-08T00:00:00+00:00 + 2022-08-22T00:00:00+00:00 https://nutch.apache.org/categories/ - 2021-09-08T00:00:00+00:00 + 2022-08-22T00:00:00+00:00 https://nutch.apache.org/news/ - 2021-09-08T00:00:00+00:00 + 2022-08-22T00:00:00+00:00 https://nutch.apache.org/tags/release/ - 2021-09-08T00:00:00+00:00 + 2022-08-22T00:00:00+00:00
[nutch-site] branch asf-staging updated: Announce release of Nutch 1.19 - fix release data in announcement
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch asf-staging in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/asf-staging by this push: new 87176ac Announce release of Nutch 1.19 - fix release data in announcement 87176ac is described below commit 87176ac53fa8fc604abebf23fabf4ad5a77bfd6b Author: Sebastian Nagel AuthorDate: Thu Sep 8 15:26:59 2022 +0200 Announce release of Nutch 1.19 - fix release data in announcement --- content/categories/index.xml | 4 ++-- content/categories/releases/index.xml | 4 ++-- content/index.xml | 4 ++-- content/news/index.xml | 4 ++-- content/news/nutch-1.19-release/index.html | 4 ++-- content/sitemap.xml| 16 content/tags/1.19/index.xml| 4 ++-- content/tags/index.xml | 6 +++--- content/tags/release/index.xml | 4 ++-- 9 files changed, 25 insertions(+), 25 deletions(-) diff --git a/content/categories/index.xml b/content/categories/index.xml index cb64e1f..9f537a6 100644 --- a/content/categories/index.xml +++ b/content/categories/index.xml @@ -6,11 +6,11 @@ Recent content in Categories on Apache Nutch™ Hugo -- gohugo.io en-us -Wed, 08 Sep 2021 00:00:00 + +Mon, 22 Aug 2022 00:00:00 + releases /categories/releases/ - Wed, 08 Sep 2021 00:00:00 + + Mon, 22 Aug 2022 00:00:00 + /categories/releases/ diff --git a/content/categories/releases/index.xml b/content/categories/releases/index.xml index 5ea4121..80e9797 100644 --- a/content/categories/releases/index.xml +++ b/content/categories/releases/index.xml @@ -6,11 +6,11 @@ Recent content in releases on Apache Nutch™ Hugo -- gohugo.io en-us -Wed, 08 Sep 2021 00:00:00 + +Mon, 22 Aug 2022 00:00:00 + Nutch 1.19 Release /news/nutch-1.19-release/ - Wed, 08 Sep 2021 00:00:00 + + Mon, 22 Aug 2022 00:00:00 + /news/nutch-1.19-release/ The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.19, we advise all current users and developers of the 1.X series to upgrade to this release. diff --git a/content/index.xml b/content/index.xml index 8a0e2c4..f2fd700 100644 --- a/content/index.xml +++ b/content/index.xml @@ -6,11 +6,11 @@ Recent content on Apache Nutch™ Hugo -- gohugo.io en-us -Wed, 08 Sep 2021 00:00:00 + +Mon, 22 Aug 2022 00:00:00 + Nutch 1.19 Release /news/nutch-1.19-release/ - Wed, 08 Sep 2021 00:00:00 + + Mon, 22 Aug 2022 00:00:00 + /news/nutch-1.19-release/ The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.19, we advise all current users and developers of the 1.X series to upgrade to this release. diff --git a/content/news/index.xml b/content/news/index.xml index ff055e8..d129c73 100644 --- a/content/news/index.xml +++ b/content/news/index.xml @@ -6,11 +6,11 @@ Recent content in Project News on Apache Nutch™ Hugo -- gohugo.io en-us -Wed, 08 Sep 2021 00:00:00 + +Mon, 22 Aug 2022 00:00:00 + Nutch 1.19 Release /news/nutch-1.19-release/ - Wed, 08 Sep 2021 00:00:00 + + Mon, 22 Aug 2022 00:00:00 + /news/nutch-1.19-release/ The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.19, we advise all current users and developers of the 1.X series to upgrade to this release. diff --git a/content/news/nutch-1.19-release/index.html b/content/news/nutch-1.19-release/index.html index 10c420b..fb55985 100644 --- a/content/news/nutch-1.19-release/index.html +++ b/content/news/nutch-1.19-release/index.html @@ -25,8 +25,8 @@ - - + + diff --git a/content/sitemap.xml b/content/sitemap.xml index e3fa139..5570667 100644 --- a/content/sitemap.xml +++ b/content/sitemap.xml @@ -5,7 +5,7 @@ https://nutch.apache.org/news/nutch-1.19-release/ - 2021-09-08T00:00:00+00:00 + 2022-08-22T00:00:00+00:00 https://nutch.apache.org/news/nutch-1.18-release/ @@ -65,31 +65,31 @@ https://nutch.apache.org/tags/1.19/ - 2021-09-08T00:00:00+00:00 + 2022-08-22T00:00:00+00:00 https://nutch.apache.org/ - 2021-09-08T00:00:00+00:00 + 2022-08-22T00:00:00+00:00 https://nutch.apache.org/categories/ - 2021-09-08T00:00:00+00:00 + 2022-08-22T00:00:00+00:00 https://nutch.apache.org/news/ - 2021-09-08T00:00:00+00:00 + 2022-08-22T00:00:00+00:00 https://nutch.apache.org/tags/release/ - 2021-09-08T00:00:00+00:00 + 2022-08-22T00:00:00+00
[nutch-site] branch asf-site updated (a41c7ef -> 314b1b2)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch asf-site in repository https://gitbox.apache.org/repos/asf/nutch-site.git from a41c7ef Add doap.rdf (lost during CMS migration) new 1e7bf4e - add README for branch asf-site - modify .asf.yaml to contain only instructions required in branch asf-site new 45468fc Update content from Hugo build after adding Kube modified templates new 314b1b2 Announce release of Nutch 1.19 The 3 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .asf.yaml |17 +- README.md | 7 + content/apache/index.html |11 +- content/categories/index.html |16 +- content/categories/index.xml | 4 +- content/categories/news/index.html |16 +- content/categories/releases/index.html |25 +- content/categories/releases/index.xml |13 +- content/community/board-reporting/index.html |11 +- content/community/bot/index.html |11 +- content/community/contributing/index.html |11 +- content/community/index.html |14 +- content/community/index.xml| 4 +- content/community/mailing-lists/index.html |11 +- content/community/merchandise/index.html |11 +- content/community/people-credits/index.html|11 +- content/development/index.html |14 +- content/development/index.xml | 4 +- content/development/issue-tracker/index.html |11 +- content/development/nightly-builds/index.html |11 +- .../development/source-code-management/index.html |11 +- content/doap.rdf | 9 +- content/documentation/about/index.html |11 +- content/documentation/faqs/index.html |11 +- content/documentation/index.html |14 +- content/documentation/index.xml| 4 +- .../javadoc/apidocs/allclasses-frame.html | 549 - .../javadoc/apidocs/allclasses-index.html | 2845 + .../{allclasses-noframe.html => allclasses.html} |83 +- .../javadoc/apidocs/allpackages-index.html | 826 ++ .../javadoc/apidocs/constant-values.html | 2415 ++-- .../javadoc/apidocs/deprecated-list.html | 155 +- .../javadoc/apidocs/{package-list => element-list} |18 +- .../documentation/javadoc/apidocs/help-doc.html| 169 +- .../documentation/javadoc/apidocs/index-all.html | 7425 ++-- content/documentation/javadoc/apidocs/index.html | 885 +- .../apidocs/jquery/external/jquery/jquery.js | 10872 ++ .../jquery/images/ui-bg_glass_55_fbf9ee_1x400.png | Bin 0 -> 335 bytes .../jquery/images/ui-bg_glass_65_dadada_1x400.png | Bin 0 -> 262 bytes .../jquery/images/ui-bg_glass_75_dadada_1x400.png | Bin 0 -> 262 bytes .../jquery/images/ui-bg_glass_75_e6e6e6_1x400.png | Bin 0 -> 262 bytes .../jquery/images/ui-bg_glass_95_fef1ec_1x400.png | Bin 0 -> 332 bytes .../ui-bg_highlight-soft_75_cc_1x100.png | Bin 0 -> 280 bytes .../jquery/images/ui-icons_22_256x240.png | Bin 0 -> 6922 bytes .../jquery/images/ui-icons_2e83ff_256x240.png | Bin 0 -> 4549 bytes .../jquery/images/ui-icons_454545_256x240.png | Bin 0 -> 6992 bytes .../jquery/images/ui-icons_88_256x240.png | Bin 0 -> 6999 bytes .../jquery/images/ui-icons_cd0a0a_256x240.png | Bin 0 -> 4549 bytes .../javadoc/apidocs/jquery/jquery-3.5.1.js | 10872 ++ .../javadoc/apidocs/jquery/jquery-ui.css | 582 + .../javadoc/apidocs/jquery/jquery-ui.js| 2659 + .../javadoc/apidocs/jquery/jquery-ui.min.css | 7 + .../javadoc/apidocs/jquery/jquery-ui.min.js| 6 + .../javadoc/apidocs/jquery/jquery-ui.structure.css | 156 + .../apidocs/jquery/jquery-ui.structure.min.css | 5 + .../jquery/jszip-utils/dist/jszip-utils-ie.js |56 + .../jquery/jszip-utils/dist/jszip-utils-ie.min.js |10 + .../apidocs/jquery/jszip-utils/dist/jszip-utils.js | 118 + .../jquery/jszip-utils/dist/jszip-utils.min.js |10 + .../javadoc/apidocs/jquery/jszip/dist/jszip.js | 11370 +++ .../javadoc/apidocs/jquery/jszip/dist/jszip.min.js |13 + .../javadoc/apidocs/member-search-index.js | 1 + .../javadoc/apidocs/member-search-index.zip| Bin 0 -> 40331 byt
[nutch-site] 02/03: Update content from Hugo build after adding Kube modified templates
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/nutch-site.git commit 45468fc2c2e83cfe1aef57f437ca02991c0256b3 Author: Sebastian Nagel AuthorDate: Thu Sep 8 10:54:37 2022 +0200 Update content from Hugo build after adding Kube modified templates --- content/apache/index.html | 11 ++-- content/categories/index.html | 16 ++--- content/categories/news/index.html | 16 ++--- content/categories/releases/index.html | 16 ++--- content/community/board-reporting/index.html | 11 ++-- content/community/bot/index.html | 11 ++-- content/community/contributing/index.html | 11 ++-- content/community/index.html | 14 ++--- content/community/index.xml| 4 +- content/community/mailing-lists/index.html | 11 ++-- content/community/merchandise/index.html | 11 ++-- content/community/people-credits/index.html| 11 ++-- content/development/index.html | 14 ++--- content/development/index.xml | 4 +- content/development/issue-tracker/index.html | 11 ++-- content/development/nightly-builds/index.html | 11 ++-- .../development/source-code-management/index.html | 11 ++-- content/doap.rdf | 2 +- content/documentation/about/index.html | 11 ++-- content/documentation/faqs/index.html | 11 ++-- content/documentation/index.html | 14 ++--- content/documentation/index.xml| 4 +- content/documentation/javadoc/index.html | 15 ++--- content/documentation/tutorials/index.html | 11 ++-- content/documentation/wiki/index.html | 11 ++-- content/download/index.html| 27 content/favicon.ico| Bin 0 -> 894 bytes content/img/{kube => }/plug.svg| 0 content/img/{kube => }/plus-square.svg | 0 content/index.html | 70 ++--- content/index.xml | 3 +- content/news/index.html| 16 ++--- content/news/index.xml | 4 +- content/news/legacy-nutch-news/index.html | 11 ++-- content/news/nutch-1.18-release/index.html | 11 ++-- content/tags/1.18/index.html | 16 ++--- content/tags/index.html| 16 ++--- content/tags/legacy/index.html | 16 ++--- content/tags/news/index.html | 16 ++--- content/tags/release/index.html| 16 ++--- 40 files changed, 257 insertions(+), 238 deletions(-) diff --git a/content/apache/index.html b/content/apache/index.html index adaf256..3d84f47 100644 --- a/content/apache/index.html +++ b/content/apache/index.html @@ -2,11 +2,11 @@ - + - Apache + Apache Nutch™ – Apache @@ -38,9 +38,11 @@ + + - @@ -117,8 +119,7 @@ - - 2004-2021 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target="_blank" rel="noopener noreferrer">kube Theme for Hugo. Apache Nutch, Nutch, Apache, the Apache feather logo, and the Apache Nutch project logo are trademarks of The Apache Software Foundation. + 2004-2022 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target="_blank" rel="noopener noreferrer">kube Theme for Hugo. Apache Nutch, Nutch, Apache, the Apache feather logo, and the Apache Nutch project logo are trademarks of The Apache Software Foundation. diff --git a/content/categories/index.html b/content/categories/index.html index fb07b86..ecec8b1 100644 --- a/content/categories/index.html +++ b/content/categories/index.html @@ -2,11 +2,11 @@ - + - Categories + Apache Nutch™ – Categories @@ -37,10 +37,11 @@ - + + - @@ -99,8 +100,8 @@ - Project News -News, activity, ideas, and whatever feels important. https://twitter.com/@ApacheNutch;>Follow us on Twitter + Categories + @@ -135,8 +136,7 @@ - - 2004-2021 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target="_blank" rel="noopener noreferrer">kube Theme for Hugo. Apache Nutch, Nutch, Apache, the Apache feather logo, and the Apache Nutch project logo are trademarks of The Apache Software Foundation. + 2004-2022 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target=&qu
[nutch-site] 01/03: - add README for branch asf-site - modify .asf.yaml to contain only instructions required in branch asf-site
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/nutch-site.git commit 1e7bf4e9e7c2f5450444623847476d1b73d7b773 Author: Sebastian Nagel AuthorDate: Thu Sep 8 14:59:33 2022 +0200 - add README for branch asf-site - modify .asf.yaml to contain only instructions required in branch asf-site --- .asf.yaml | 17 ++--- README.md | 7 +++ 2 files changed, 9 insertions(+), 15 deletions(-) diff --git a/.asf.yaml b/.asf.yaml index 0cc84e6..2ae5ca7 100644 --- a/.asf.yaml +++ b/.asf.yaml @@ -15,20 +15,7 @@ # specific language governing permissions and limitations # under the License. -# https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories - -github: - description: "Apache Nutch Website" - homepage: https://nutch.apache.org/ - labels: -- apache -- nutch -- hugo - - enabled_merge_buttons: -squash: true -merge: false -rebase: false +# https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features publish: - whoami: asf-site \ No newline at end of file + whoami: asf-site diff --git a/README.md b/README.md new file mode 100644 index 000..dd2eb49 --- /dev/null +++ b/README.md @@ -0,0 +1,7 @@ +Apache Nutch Website + + +The `asf-site` branch is only used for storing the generated static website. +From this branch, the Nutch website is being served. + +Please submit patch and pull requests on the `main` branch instead of the `asf-site` branch. \ No newline at end of file
[nutch-site] branch asf-staging updated (3e9e725 -> 2cfe00d)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch asf-staging in repository https://gitbox.apache.org/repos/asf/nutch-site.git discard 3e9e725 Announce release of Nutch 1.19 new 2cfe00d Announce release of Nutch 1.19 This update added new revisions after undoing existing revisions. That is to say, some revisions that were in the old version of the branch are not in the new version. This situation occurs when a user --force pushes a change and generates a repository containing something like this: * -- * -- B -- O -- O -- O (3e9e725) \ N -- N -- N refs/heads/asf-staging (2cfe00d) You should already have received notification emails for all of the O revisions, and so the following emails describe only the N revisions from the common base, B. Any revisions marked "omit" are not gone; other references still refer to them. Any revisions marked "discard" are gone forever. The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: content/categories/index.xml | 4 +- content/categories/releases/index.html | 9 + content/categories/releases/index.xml |13 +- content/doap.rdf | 7 + .../javadoc/apidocs/allclasses-frame.html | 549 - .../javadoc/apidocs/allclasses-index.html | 2845 + .../{allclasses-noframe.html => allclasses.html} |83 +- .../javadoc/apidocs/allpackages-index.html | 826 ++ .../javadoc/apidocs/constant-values.html | 2415 ++-- .../javadoc/apidocs/deprecated-list.html | 155 +- .../javadoc/apidocs/{package-list => element-list} |18 +- .../documentation/javadoc/apidocs/help-doc.html| 169 +- .../documentation/javadoc/apidocs/index-all.html | 7425 ++-- content/documentation/javadoc/apidocs/index.html | 885 +- .../apidocs/jquery/external/jquery/jquery.js | 10872 ++ .../jquery/images/ui-bg_glass_55_fbf9ee_1x400.png | Bin 0 -> 335 bytes .../jquery/images/ui-bg_glass_65_dadada_1x400.png | Bin 0 -> 262 bytes .../jquery/images/ui-bg_glass_75_dadada_1x400.png | Bin 0 -> 262 bytes .../jquery/images/ui-bg_glass_75_e6e6e6_1x400.png | Bin 0 -> 262 bytes .../jquery/images/ui-bg_glass_95_fef1ec_1x400.png | Bin 0 -> 332 bytes .../ui-bg_highlight-soft_75_cc_1x100.png | Bin 0 -> 280 bytes .../jquery/images/ui-icons_22_256x240.png | Bin 0 -> 6922 bytes .../jquery/images/ui-icons_2e83ff_256x240.png | Bin 0 -> 4549 bytes .../jquery/images/ui-icons_454545_256x240.png | Bin 0 -> 6992 bytes .../jquery/images/ui-icons_88_256x240.png | Bin 0 -> 6999 bytes .../jquery/images/ui-icons_cd0a0a_256x240.png | Bin 0 -> 4549 bytes .../javadoc/apidocs/jquery/jquery-3.5.1.js | 10872 ++ .../javadoc/apidocs/jquery/jquery-ui.css | 582 + .../javadoc/apidocs/jquery/jquery-ui.js| 2659 + .../javadoc/apidocs/jquery/jquery-ui.min.css | 7 + .../javadoc/apidocs/jquery/jquery-ui.min.js| 6 + .../javadoc/apidocs/jquery/jquery-ui.structure.css | 156 + .../apidocs/jquery/jquery-ui.structure.min.css | 5 + .../jquery/jszip-utils/dist/jszip-utils-ie.js |56 + .../jquery/jszip-utils/dist/jszip-utils-ie.min.js |10 + .../apidocs/jquery/jszip-utils/dist/jszip-utils.js | 118 + .../jquery/jszip-utils/dist/jszip-utils.min.js |10 + .../javadoc/apidocs/jquery/jszip/dist/jszip.js | 11370 +++ .../javadoc/apidocs/jquery/jszip/dist/jszip.min.js |13 + .../javadoc/apidocs/member-search-index.js | 1 + .../javadoc/apidocs/member-search-index.zip| Bin 0 -> 40331 bytes .../nutch/analysis/lang/HTMLLanguageParser.html| 208 +- .../analysis/lang/LanguageIndexingFilter.html | 211 +- .../lang/class-use/HTMLLanguageParser.html |94 +- .../lang/class-use/LanguageIndexingFilter.html |94 +- .../apache/nutch/analysis/lang/package-frame.html |21 - .../nutch/analysis/lang/package-summary.html | 116 +- .../apache/nutch/analysis/lang/package-tree.html |98 +- .../apache/nutch/analysis/lang/package-use.html|90 +- .../apache/nutch/any23/Any23IndexingFilter.html| 257 +- .../org/apache/nutch/any23/Any23ParseFilter.html | 266 +- .../nutch/any23/class-use/Any23IndexingFilter.html |94 +- .../nutch/any23/class-use/Any23ParseFilter.html|94 +- .../org/apache/nutch/any23/package-frame.html |21 - .../org/apache/nutch/any23/package-summary.html| 124 +- .../org/apache/
[nutch-site] branch asf-staging updated: Announce release of Nutch 1.19
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch asf-staging in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/asf-staging by this push: new 3e9e725 Announce release of Nutch 1.19 3e9e725 is described below commit 3e9e72539a4fbf9ea5dd3b5a6fafb03a6d0229ca Author: Sebastian Nagel AuthorDate: Thu Sep 8 14:58:43 2022 +0200 Announce release of Nutch 1.19 --- favicon.ico | Bin 894 -> 0 bytes 1 file changed, 0 insertions(+), 0 deletions(-) diff --git a/favicon.ico b/favicon.ico deleted file mode 100644 index e07ca51..000 Binary files a/favicon.ico and /dev/null differ
svn commit: r56738 [1/3] - /release/nutch/1.19/CHANGES.txt
Author: snagel Date: Thu Sep 8 12:44:33 2022 New Revision: 56738 Log: Release Apache Nutch 1.19 - add change log Added: release/nutch/1.19/CHANGES.txt (with props)
svn commit: r56738 [3/3] - /release/nutch/1.19/CHANGES.txt
Propchange: release/nutch/1.19/CHANGES.txt -- svn:eol-style = native
svn commit: r56738 [2/3] - /release/nutch/1.19/CHANGES.txt
in REST workflow must be ingested into HDFS +[NUTCH-2329] - Update Slf4j logging for Java 8 and upgrade miredot plugin version +[NUTCH-2336] - SegmentReader to implement Tool +[NUTCH-2352] - Log with Generic Class Name at Nutch 1.x +[NUTCH-2355] - Protocol plugins to set cookie if Cookie metadata field is present +[NUTCH-2367] - Get single record from HostDB + +New Feature + +[NUTCH-2132] - Publisher/Subscriber model for Nutch to emit events + +Task + +[NUTCH-2171] - Upgrade Nutch Trunk to Java 1.8 + + +Nutch 1.12 Release 28/05/2016 (dd/mm/) +Release Report: https://s.apache.org/nutch1.12 + +Comments + +Fellow committers, Nutch 1.12 contains a breaking change NUTCH-2220. Please use the note below and +in the release announcement and keep it on top in this CHANGES.txt for the Nutch 1.12 release. + +* replace your old conf/nutch-default.xml with the conf/nutch-default.xml from Nutch 1.12 release +* if you use LinkDB (e.g. invertlinks) and modified parameters db.max.inlinks and/or db.max.anchor.length + and/or db.ignore.internal.links, rename those parameters to linkdb.max.inlinks and + linkdb.max.anchor.length and linkdb.ignore.internal.links +* db.ignore.internal.links and db.ignore.external.links now operate on the CrawlDB only +* linkdb.ignore.internal.links and linkdb.ignore.external.links now operate on the LinkDB only + +Sub-task + +[NUTCH-2250] - CommonCrawlDumper : Invalid format + skipped parts + +Bug + +[NUTCH-2042] - parse-html increase chunk size used to detect charset +[NUTCH-2180] - FileDumper dumps data, but breaks midway on corrupt segments +[NUTCH-2189] - Domain filter must deactivate if no rules are present +[NUTCH-2203] - Suffix URL filter can't handle trailing/leading whitespaces +[NUTCH-2206] - Provide example scoring.similarity.stopword.file +[NUTCH-2213] - CommonCrawlDataDumper saves gzipped body in extracted form +[NUTCH-2223] - Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection +[NUTCH-2224] - Average bytes/second calculated incorrectly in fetcher +[NUTCH-2225] - Parsed time calculated incorrectly +[NUTCH-2228] - Plugin index-replace unit test broken on Java 8 +[NUTCH-2232] - DeduplicationJob should decode URL's before length is compared +[NUTCH-2241] - Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration +[NUTCH-2256] - Inconsistent log level practice + +Improvement + +[NUTCH-1233] - Rely on Tika for outlink extraction +[NUTCH-1712] - Use MultipleInputs in Injector to make it a single mapreduce job +[NUTCH-2172] - index-more: document format of contenttype-mapping.txt +[NUTCH-2178] - DeduplicationJob to optionally group on host or domain +[NUTCH-2182] - Make reverseUrlDirs file dumper option hash the URL for consistency +[NUTCH-2183] - Improvement to SegmentChecker for skipping non-segments present in segments directory +[NUTCH-2187] - Change FileDumper SHAs to all uppercase +[NUTCH-2195] - IndexingFilterChecker to optionally follow N redirects +[NUTCH-2196] - IndexingFilterChecker to optionally normalize +[NUTCH-2197] - Add solr5 solrcloud indexer support +[NUTCH-2204] - Remove junit lib from runtime +[NUTCH-2218] - Switch CrawlCompletion arg parsing to Commons CLI +[NUTCH-2221] - Introduce db.ignore.internal.links to FetcherThread +[NUTCH-2229] - Allow Jexl expressions on CrawlDatum's fixed attributes +[NUTCH-2231] - Jexl support in generator job +[NUTCH-2252] - Allow phantomjs as a browser for selenium options +[NUTCH-2263] - Support for mingram and maxgram at Unigram Cosine Similarity Model + +New Feature + +[NUTCH-961] - Expose Tika's boilerpipe support +[NUTCH-1325] - HostDB for Nutch +[NUTCH-2144] - Plugin to override db.ignore.external to exempt interesting external domain URLs +[NUTCH-2190] - Protocol normalizer +[NUTCH-2191] - Add protocol-htmlunit +[NUTCH-2194] - Run IndexingFilterChecker as simple Telnet server +[NUTCH-2219] - Criteria order to be configurable in DeduplicationJob +[NUTCH-2227] - RegexParseFilter +[NUTCH-2245] - Developed the NGram Model on the existing Unigram Cosine Similarity Model + +Task + +[NUTCH-2201] - Remove loops program from webgraph package +[NUTCH-2211] - Filter and normalizer checkers missing in bin/nutch +[NUTCH-2220] - Rename db.* options used only by the linkdb to linkdb.* + +Nutch 1.11 Release 03/12/2015 (dd/mm/) +Release Report: http://s.apache.org/nutch11 + +* NUTCH-2176 Clean up of log4j.properties (markus) + +* NUTCH-2107 plugin.xml to validate against plugin.dtd (snagel) + +* NUTCH-2177 Generator produces only one partition even in distributed mode (jnioche, snagel) + +* NUTCH-2158 Upgrade to Tika 1.11 (jnioche, snagel) + +* NUTCH-2175 Typos in property descriptions in nutch-default.xml (Roannel Fernández Hernández via snagel) + +* NUTCH-2069 Ignore external links base
[nutch-site] branch main updated: NUTCH-1999 Add /robots.txt to Nutch site (#1)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/main by this push: new 4efc5a9 NUTCH-1999 Add /robots.txt to Nutch site (#1) 4efc5a9 is described below commit 4efc5a9aca57430549b44a30191de041224ab865 Author: Sebastian Nagel AuthorDate: Thu Sep 8 14:19:10 2022 +0200 NUTCH-1999 Add /robots.txt to Nutch site (#1) - add robots.txt - add template to generate sitemap - include sitemap in robots.txt --- config.toml | 2 ++ content/robots.txt | 4 layouts/_default/sitemap.xml | 10 ++ 3 files changed, 16 insertions(+) diff --git a/config.toml b/config.toml index cc8832a..a78ef2d 100644 --- a/config.toml +++ b/config.toml @@ -11,6 +11,7 @@ Paginate = 4 unsafe = true # allow raw HTML in markdown content [Params] + siteBaseURL = "https://nutch.apache.org; RSSLink = "/index.xml" author = "Apache Nutch Project Management Committee" github = "https://github.com/apache/nutch; @@ -41,3 +42,4 @@ unsafe = true # allow raw HTML in markdown content name = "Apache" weight = -100 url = "/apache/" + diff --git a/content/robots.txt b/content/robots.txt new file mode 100644 index 000..086e6ad --- /dev/null +++ b/content/robots.txt @@ -0,0 +1,4 @@ +User-agent: * +Allow: / + +Sitemap: https://nutch.apache.org/sitemap.xml \ No newline at end of file diff --git a/layouts/_default/sitemap.xml b/layouts/_default/sitemap.xml new file mode 100644 index 000..006e6ba --- /dev/null +++ b/layouts/_default/sitemap.xml @@ -0,0 +1,10 @@ +{{ printf "" | safeHTML }} +http://www.sitemaps.org/schemas/sitemap/0.9; + xmlns:xhtml="http://www.w3.org/1999/xhtml;> + + {{ range .Data.Pages }}{{ if ne .Params.sitemapExclude true }} +{{ $url := urls.Parse .Permalink }} + {{ .Site.Params.SiteBaseURL }}{{ $url.Path }}{{ if not .Lastmod.IsZero }} + {{ safeHTML ( .Lastmod.Format "2006-01-02T15:04:05-07:00" ) }}{{ end }} +{{ end }}{{ end }} +
[nutch-site] branch asf-staging updated: - add README for branch asf-staging - modify .asf.yaml to contain only instructions required in branch asf-staging
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch asf-staging in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/asf-staging by this push: new b649bf6 - add README for branch asf-staging - modify .asf.yaml to contain only instructions required in branch asf-staging b649bf6 is described below commit b649bf67d76f263402bdb05af69d38b2fc2d61cc Author: Sebastian Nagel AuthorDate: Thu Sep 8 13:52:12 2022 +0200 - add README for branch asf-staging - modify .asf.yaml to contain only instructions required in branch asf-staging --- .asf.yaml | 18 +- README.md | 8 2 files changed, 9 insertions(+), 17 deletions(-) diff --git a/.asf.yaml b/.asf.yaml index d845c31..3ef9b83 100644 --- a/.asf.yaml +++ b/.asf.yaml @@ -17,22 +17,6 @@ # https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features -github: - description: "Apache Nutch Website" - homepage: https://nutch.apache.org/ - labels: -- apache -- nutch -- hugo - - enabled_merge_buttons: -squash: true -merge: false -rebase: false - staging: profile: ~ - whoami: asf-staging - -publish: - whoami: asf-site \ No newline at end of file + whoami: asf-staging diff --git a/README.md b/README.md new file mode 100644 index 000..e2c7c89 --- /dev/null +++ b/README.md @@ -0,0 +1,8 @@ +Apache Nutch Website + + +The `asf-staging` branch is only used for storing the generated static website to preview proposed changes to the website. + +The preview site is at https://nutch.staged.apache.org + +Please submit patch and pull requests on the `main` branch instead of the `asf-staging` branch.
[nutch-site] branch NUTCH-1999-nutch-site-robots-txt updated (142489f -> f863c1f)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch NUTCH-1999-nutch-site-robots-txt in repository https://gitbox.apache.org/repos/asf/nutch-site.git omit 142489f NUTCH-1999 Add /robots.txt to Nutch site add 6e318e6 Add modified Kube theme - taken from commit a00af40 of https://github.com/jeblister/kube - with the following modifications - modified layouts/index.html to include body from _index.md - added section landing pages layouts/section/{community,development,documentation}.html - replace favicon by Nutch one (fix NUTCH-2928) - added custom.css - update README add 6d78162 NUTCH-1999 Add /robots.txt to Nutch site add f863c1f NUTCH-1999 Add /robots.txt to Nutch site - add template to generate sitemap - include sitemap in robots.txt This update added new revisions after undoing existing revisions. That is to say, some revisions that were in the old version of the branch are not in the new version. This situation occurs when a user --force pushes a change and generates a repository containing something like this: * -- * -- B -- O -- O -- O (142489f) \ N -- N -- N refs/heads/NUTCH-1999-nutch-site-robots-txt (f863c1f) You should already have received notification emails for all of the O revisions, and so the following emails describe only the N revisions from the common base, B. Any revisions marked "omit" are not gone; other references still refer to them. Any revisions marked "discard" are gone forever. No new revisions were added by this update. Summary of changes: .asf.yaml |6 +- README.md | 16 +- config.toml|8 +- content/_index.md | 34 +- content/community/_index.md|3 + content/development/_index.md |3 + content/doap.rdf |2 +- content/documentation/_index.md|3 + content/download.md|8 +- content/favicon.ico| Bin 0 -> 894 bytes content/news/_index.md |4 + content/robots.txt |2 + layouts/_default/sitemap.xml | 10 + static/img/plug.svg|3 + static/img/plus-square.svg |1 + themes/kube|1 - themes/kube/LICENSE.md | 20 + themes/kube/README.md | 169 ++ {archetypes => themes/kube/archetypes}/blog.md |0 {archetypes => themes/kube/archetypes}/docs.md |0 themes/kube/images/docs.png| Bin 0 -> 87844 bytes themes/kube/images/faq.png | Bin 0 -> 130413 bytes themes/kube/images/list-docs.png | Bin 0 -> 69093 bytes themes/kube/images/post.png| Bin 0 -> 73445 bytes themes/kube/images/screenshot.png | Bin 0 -> 71302 bytes themes/kube/images/signin.png | Bin 0 -> 39120 bytes themes/kube/images/tn.png | Bin 0 -> 37334 bytes .gitignore => themes/kube/layouts/404.html |0 themes/kube/layouts/_default/baseof.html | 66 + themes/kube/layouts/_default/list.html | 19 + themes/kube/layouts/_default/single.html | 18 + themes/kube/layouts/blog/single.html | 63 + themes/kube/layouts/docs/single.html | 23 + themes/kube/layouts/index.html | 15 + themes/kube/layouts/partials/favicon.html |1 + themes/kube/layouts/partials/footer.html |3 + themes/kube/layouts/partials/header.html | 29 + themes/kube/layouts/partials/meta/name-author.html |6 + themes/kube/layouts/partials/meta/ogimage.html |8 + themes/kube/layouts/partials/page-summary.html |9 + themes/kube/layouts/partials/pagination.html | 15 + themes/kube/layouts/partials/post/byauthor.html| 20 + .../kube/layouts/partials/post/category-link.html |1 + themes/kube/layouts/partials/post/meta.html| 14 + .../layouts/partials/post/related-content.html | 16 + themes/kube/layouts/partials/post/tag-link.html|1 + .../kube/layouts/partials/scripts/animation.html | 127 ++ .../kube/layouts/partials/site-verification.html | 12 + themes/kube/layouts/partials/toc.html | 21 + themes/kube/layouts/section/community.html | 22 + themes/kube/layouts/section/development.html | 22 + themes/kube/layouts/section/documentation.html | 22 + themes/kube/layouts/section
[nutch-site] branch asf-staging updated: Sync .asf.yaml file with main branch
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch asf-staging in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/asf-staging by this push: new ee7f0b2 Sync .asf.yaml file with main branch ee7f0b2 is described below commit ee7f0b22b8562b8550c8172b7817eb5c089bc221 Author: Sebastian Nagel AuthorDate: Thu Sep 8 12:07:34 2022 +0200 Sync .asf.yaml file with main branch --- .asf.yaml | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/.asf.yaml b/.asf.yaml index 0cc84e6..d845c31 100644 --- a/.asf.yaml +++ b/.asf.yaml @@ -15,7 +15,7 @@ # specific language governing permissions and limitations # under the License. -# https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories +# https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features github: description: "Apache Nutch Website" @@ -30,5 +30,9 @@ github: merge: false rebase: false +staging: + profile: ~ + whoami: asf-staging + publish: whoami: asf-site \ No newline at end of file
[nutch-site] 01/01: Update content from Hugo build after adding Kube modified templates
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch asf-staging in repository https://gitbox.apache.org/repos/asf/nutch-site.git commit d77dbb51645aa6a0249564730aba051ac5585a2e Author: Sebastian Nagel AuthorDate: Thu Sep 8 10:54:37 2022 +0200 Update content from Hugo build after adding Kube modified templates --- content/apache/index.html | 11 ++-- content/categories/index.html | 16 ++--- content/categories/news/index.html | 16 ++--- content/categories/releases/index.html | 16 ++--- content/community/board-reporting/index.html | 11 ++-- content/community/bot/index.html | 11 ++-- content/community/contributing/index.html | 11 ++-- content/community/index.html | 14 ++--- content/community/index.xml| 4 +- content/community/mailing-lists/index.html | 11 ++-- content/community/merchandise/index.html | 11 ++-- content/community/people-credits/index.html| 11 ++-- content/development/index.html | 14 ++--- content/development/index.xml | 4 +- content/development/issue-tracker/index.html | 11 ++-- content/development/nightly-builds/index.html | 11 ++-- .../development/source-code-management/index.html | 11 ++-- content/doap.rdf | 2 +- content/documentation/about/index.html | 11 ++-- content/documentation/faqs/index.html | 11 ++-- content/documentation/index.html | 14 ++--- content/documentation/index.xml| 4 +- content/documentation/javadoc/index.html | 15 ++--- content/documentation/tutorials/index.html | 11 ++-- content/documentation/wiki/index.html | 11 ++-- content/download/index.html| 27 content/favicon.ico| Bin 0 -> 894 bytes content/img/{kube => }/plug.svg| 0 content/img/{kube => }/plus-square.svg | 0 content/index.html | 70 ++--- content/index.xml | 3 +- content/news/index.html| 16 ++--- content/news/index.xml | 4 +- content/news/legacy-nutch-news/index.html | 11 ++-- content/news/nutch-1.18-release/index.html | 11 ++-- content/tags/1.18/index.html | 16 ++--- content/tags/index.html| 16 ++--- content/tags/legacy/index.html | 16 ++--- content/tags/news/index.html | 16 ++--- content/tags/release/index.html| 16 ++--- 40 files changed, 257 insertions(+), 238 deletions(-) diff --git a/content/apache/index.html b/content/apache/index.html index adaf256..3d84f47 100644 --- a/content/apache/index.html +++ b/content/apache/index.html @@ -2,11 +2,11 @@ - + - Apache + Apache Nutch™ – Apache @@ -38,9 +38,11 @@ + + - @@ -117,8 +119,7 @@ - - 2004-2021 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target="_blank" rel="noopener noreferrer">kube Theme for Hugo. Apache Nutch, Nutch, Apache, the Apache feather logo, and the Apache Nutch project logo are trademarks of The Apache Software Foundation. + 2004-2022 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target="_blank" rel="noopener noreferrer">kube Theme for Hugo. Apache Nutch, Nutch, Apache, the Apache feather logo, and the Apache Nutch project logo are trademarks of The Apache Software Foundation. diff --git a/content/categories/index.html b/content/categories/index.html index fb07b86..ecec8b1 100644 --- a/content/categories/index.html +++ b/content/categories/index.html @@ -2,11 +2,11 @@ - + - Categories + Apache Nutch™ – Categories @@ -37,10 +37,11 @@ - + + - @@ -99,8 +100,8 @@ - Project News -News, activity, ideas, and whatever feels important. https://twitter.com/@ApacheNutch;>Follow us on Twitter + Categories + @@ -135,8 +136,7 @@ - - 2004-2021 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target="_blank" rel="noopener noreferrer">kube Theme for Hugo. Apache Nutch, Nutch, Apache, the Apache feather logo, and the Apache Nutch project logo are trademarks of The Apache Software Foundation. + 2004-2022 The Apache Software Foundation. Built using the https://github.com/jeblister/kube; target=&qu
[nutch-site] branch asf-staging created (now d77dbb5)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch asf-staging in repository https://gitbox.apache.org/repos/asf/nutch-site.git at d77dbb5 Update content from Hugo build after adding Kube modified templates This branch includes the following new commits: new d77dbb5 Update content from Hugo build after adding Kube modified templates The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
svn commit: r56686 - /dev/nutch/1.19/ /release/nutch/1.19/
Author: snagel Date: Tue Sep 6 08:51:59 2022 New Revision: 56686 Log: Release Apache Nutch 1.19 Added: release/nutch/1.19/ - copied from r56685, dev/nutch/1.19/ Removed: dev/nutch/1.19/
svn commit: r56398 - /dev/nutch/1.19/
Author: snagel Date: Mon Aug 22 15:15:43 2022 New Revision: 56398 Log: Stage Apache Nutch 1.19 RC#1 Added: dev/nutch/1.19/ dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz (with props) dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz.asc dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz.sha512 dev/nutch/1.19/apache-nutch-1.19-bin.zip (with props) dev/nutch/1.19/apache-nutch-1.19-bin.zip.asc dev/nutch/1.19/apache-nutch-1.19-bin.zip.sha512 dev/nutch/1.19/apache-nutch-1.19-src.tar.gz (with props) dev/nutch/1.19/apache-nutch-1.19-src.tar.gz.asc dev/nutch/1.19/apache-nutch-1.19-src.tar.gz.sha512 dev/nutch/1.19/apache-nutch-1.19-src.zip (with props) dev/nutch/1.19/apache-nutch-1.19-src.zip.asc dev/nutch/1.19/apache-nutch-1.19-src.zip.sha512 Added: dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz == Binary file - no diff available. Propchange: dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz -- svn:mime-type = application/x-gzip Added: dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz.asc == --- dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz.asc (added) +++ dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz.asc Mon Aug 22 15:15:43 2022 @@ -0,0 +1,11 @@ +-BEGIN PGP SIGNATURE- + +iQEzBAABCgAdFiEE/4Kkh/ktcOUv934Kxm6nt9sKnG0FAmMDm+wACgkQxm6nt9sK +nG0wqwf/XJTnVZ67AYZvkBorERVEvjnurC9L9FY4/7QHwh2z9q0Viftt6ODIIEAD +IjkHB8xN9cWvFiyFhG/4NFWFnQNiTUlrZ6Ppu1eYXXvI312Z++vVMMOkVVmdn+5K +S3YhoejqkO0GeMqV4PcXAiLF0/DtxaSPp+q0O29+XilSw5XB8mlHBn7VWALT5Y6s +tD5WQfBNbyOCnF4dp2eDcIZjuPof/TbIhyDU3GBNXRe772cXQIl4JrdxyftiMwz4 +m4eMdiY/lNPVEb93X6eCkyzApirZBmtfnJOKcBZsuqOFPFB9TumLWaTA4IBu8d/K +KvDDeAwmF/3t95wPVMGkOFgBLpsdTw== +=REOb +-END PGP SIGNATURE- Added: dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz.sha512 == --- dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz.sha512 (added) +++ dev/nutch/1.19/apache-nutch-1.19-bin.tar.gz.sha512 Mon Aug 22 15:15:43 2022 @@ -0,0 +1 @@ +SHA512(apache-nutch-1.19-bin.tar.gz)= ba4bfeb92fc5c95b71ab87df46baebcd904f8ea99f33824b7016a9b6a7f4d2488598c94d44954f8d5a7937d8f38164af197efdcfcf3d4f6ef0beea3ff32f0d7a Added: dev/nutch/1.19/apache-nutch-1.19-bin.zip == Binary file - no diff available. Propchange: dev/nutch/1.19/apache-nutch-1.19-bin.zip -- svn:mime-type = application/octet-stream Added: dev/nutch/1.19/apache-nutch-1.19-bin.zip.asc == --- dev/nutch/1.19/apache-nutch-1.19-bin.zip.asc (added) +++ dev/nutch/1.19/apache-nutch-1.19-bin.zip.asc Mon Aug 22 15:15:43 2022 @@ -0,0 +1,11 @@ +-BEGIN PGP SIGNATURE- + +iQEzBAABCgAdFiEE/4Kkh/ktcOUv934Kxm6nt9sKnG0FAmMDm+4ACgkQxm6nt9sK +nG0LVwgAx9rHYfP2eKvyTI8IFYr0uToH4kqaMqdlJyUilkBi3ZetnkqaNrz3Lt+J +BWp6VojbizExOMABGe8CM52+bIcxA5PSyU8IEKCS1KVSCBsiLlghv3Y3jEQX366p +SbSYBhUR8F4owxsHDOI6qlN1tqL72t4kDOOR/LcESQ1IkvGBXPTvU4a0XzWvLphM +R0GPYMgLaK6TEt04SVEWGjz4bDOwKHpxsOJQjhEzmaY0JPbOwCe4kIb4oqgqNnfH +4Yu+VSGQVruEr4u6qFfAV2EJdne2Yayc/KY5d7cZ+cvWto5/QMNhUZ3hZEoipott +ok637G1CU363BNsmmLHsM7lJMO+FTA== +=mYsK +-END PGP SIGNATURE- Added: dev/nutch/1.19/apache-nutch-1.19-bin.zip.sha512 == --- dev/nutch/1.19/apache-nutch-1.19-bin.zip.sha512 (added) +++ dev/nutch/1.19/apache-nutch-1.19-bin.zip.sha512 Mon Aug 22 15:15:43 2022 @@ -0,0 +1 @@ +SHA512(apache-nutch-1.19-bin.zip)= 4ef0b4969836d10e85851e30cffaf73bfa269bf1e974408329cc824bfae94c9caf3575d937e0b4532a9d3fc9dfe84df78180e28ff0d602b6fff4bcde0faa884a Added: dev/nutch/1.19/apache-nutch-1.19-src.tar.gz == Binary file - no diff available. Propchange: dev/nutch/1.19/apache-nutch-1.19-src.tar.gz -- svn:mime-type = application/x-gzip Added: dev/nutch/1.19/apache-nutch-1.19-src.tar.gz.asc == --- dev/nutch/1.19/apache-nutch-1.19-src.tar.gz.asc (added) +++ dev/nutch/1.19/apache-nutch-1.19-src.tar.gz.asc Mon Aug 22 15:15:43 2022 @@ -0,0 +1,11 @@ +-BEGIN PGP SIGNATURE- + +iQEzBAABCgAdFiEE/4Kkh/ktcOUv934Kxm6nt9sKnG0FAmMDm+8ACgkQxm6nt9sK +nG2lagf7BAd8rl2yGL2sZGle9PUzIwBw3jLby/ZIl88aLs7FV1oIHxVlS3lnNJPM +LKIGIkZGYUiQ4xF8v9aLl1NQ6p49Gn9nKoUboVpVkzWcknIdlTxnt2qjgHoH6THb +sIinMlI3IFKXSey76F38JToiU5ycrQPs+nnJZZjFsl/Hg+5jozoJwO/YHJID8yEs +p32n4H4Ll+El6zsgJvGlE4M1hB3tfv4QpKAcW4swNIjlD12gdiY44oahNbQkd/7v +15M10wgmW1LOIFfxtbTU
[nutch] branch branch-1.19 created (now 63d4f11c0)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch branch-1.19 in repository https://gitbox.apache.org/repos/asf/nutch.git at 63d4f11c0 Nutch 1.19 release - update current year in API docs etc. - update version number - add changes / release notes - update links to Hadoop API docs No new revisions were added by this update.
[nutch] annotated tag release-1.19 updated (63d4f11c0 -> 5d7660ceb)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to annotated tag release-1.19 in repository https://gitbox.apache.org/repos/asf/nutch.git *** WARNING: tag release-1.19 was modified! *** from 63d4f11c0 (commit) to 5d7660ceb (tag) tagging 63d4f11c08aa7f5a3f5e3dded3c880649fd6e1a2 (commit) replaces release-1.13 by Sebastian Nagel on Mon Aug 22 16:59:01 2022 +0200 - Log - Apache Nutch 1.19 RC#1 Tag --- No new revisions were added by this update. Summary of changes:
[nutch] branch master updated: NUTCH-2969 Javadoc: Javascript search is not working when built on JDK 11 - pass --no-module-directories to javadoc target when building on JDK 11 - remove obsolete cond
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new ffe059892 NUTCH-2969 Javadoc: Javascript search is not working when built on JDK 11 - pass --no-module-directories to javadoc target when building on JDK 11 - remove obsolete condition to fail javadoc builds on JDK 7u25 and earlier ffe059892 is described below commit ffe0598925fcb27ac253f61d86106b33a260a979 Author: Sebastian Nagel AuthorDate: Mon Aug 22 15:18:50 2022 +0200 NUTCH-2969 Javadoc: Javascript search is not working when built on JDK 11 - pass --no-module-directories to javadoc target when building on JDK 11 - remove obsolete condition to fail javadoc builds on JDK 7u25 and earlier --- build.xml | 40 ++-- 1 file changed, 18 insertions(+), 22 deletions(-) diff --git a/build.xml b/build.xml index 0e1e42f7c..d7377ab25 100644 --- a/build.xml +++ b/build.xml @@ -16,6 +16,7 @@ limitations under the License. --> + + + + @@ -164,17 +169,6 @@ -https://issues.apache.org/jira/browse/NUTCH-1590;> - - - - - - - - - - + + - + @@ -684,16 +684,6 @@ -https://issues.apache.org/jira/browse/NUTCH-1590;> - - - - - - - - - + +
[nutch] branch master updated (bca5fc0d0 -> 635ef2f3b)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from bca5fc0d0 NUTCH-2795 CrawlDbReader: compress CrawlDb dumps if configured - configure CSV and JSON LineRecordWriters to compress the output files according to the configuration new 3199dee64 NUTCH-2963 Upgrade dependencies before release of 1.19 - upgrade dependency-check ant plugin new cdc67c9ed NUTCH-2963 Upgrade dependencies before release of 1.19 - upgrade urlfilter-automaton to depend on dk.brics automaton 1.12-4 new 0c283980d NUTCH-2963 Upgrade dependencies before release of 1.19 - upgrade indexer-solr dependencies: - Solr 8.5.1 -> 8.11.2 - httpmime 4.5.10 -> 4.5.13 - httpcore 4.4.12 -> 4.4.15 new ef7c102eb NUTCH-2963 Upgrade dependencies before release of 1.19 - upgrade Hadoop 3.3.3 -> 3.3.4 - adapt ivy retrieve pattern to optionally include the `classifier` (used in Hadoop deps to differentiate between architecture:x86_64 vs. aarch_64) new 148c8f8a0 NUTCH-2963 Upgrade dependencies before release of 1.19 - update / complete LICENSE-binary new 59f7865e9 NUTCH-2843 Duplicate declaration of dependencies in ivy.xml - remove duplicated dependencies: commons-collections4 and httpclient - move Maven POM creation into separate target to reproduce issue new 0442562ed NUTCH-2963 Upgrade dependencies before release of 1.19 - upgrade Nutch core dependencieshttpcore-nio 4.4.9 -> 4.4.14cxf 2.9.0 -> 2.9.1commons-jexl3 3.2.1 -> 3.3log4j 2.17.2 -> 2.18.0t-digest 3.2 -> 3.3 - update / complete LICENSE-binary new 635ef2f3b Merge pull request #747 from sebastian-nagel/NUTCH-2963-upgrade-dependencies The 3331 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: LICENSE-binary| 30 - NOTICE-binary | 56 +++- build.xml | 108 +- ivy/ivy.xml | 35 +- src/plugin/indexer-solr/ivy.xml | 17 +++-- src/plugin/indexer-solr/plugin.xml| 62 - src/plugin/urlfilter-automaton/ivy.xml| 2 +- src/plugin/urlfilter-automaton/plugin.xml | 2 +- 8 files changed, 162 insertions(+), 150 deletions(-)
[nutch] branch master updated (bec577d50 -> bca5fc0d0)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from bec577d50 NUTCH-2863 Injector to parse command-line flags case-insensitive add bca5fc0d0 NUTCH-2795 CrawlDbReader: compress CrawlDb dumps if configured - configure CSV and JSON LineRecordWriters to compress the output files according to the configuration No new revisions were added by this update. Summary of changes: src/java/org/apache/nutch/crawl/CrawlDbReader.java | 53 ++ 1 file changed, 43 insertions(+), 10 deletions(-)
[nutch] branch master updated: NUTCH-2962 Update and complete package info of protocol plugins
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 6f4c80b7f NUTCH-2962 Update and complete package info of protocol plugins 6f4c80b7f is described below commit 6f4c80b7fca0f0236e079587212c2adcadba3a69 Author: Sebastian Nagel AuthorDate: Mon Aug 15 16:29:34 2022 +0200 NUTCH-2962 Update and complete package info of protocol plugins --- .../org/apache/nutch/protocol/http/api/package-info.java | 2 +- .../org/apache/nutch/protocol/htmlunit/package-info.java | 8 +++- .../org/apache/nutch/protocol/httpclient/package-info.java | 14 +++--- .../interactiveselenium/handlers}/package-info.java| 6 -- .../nutch/protocol/interactiveselenium/package-info.java | 5 - .../org/apache/nutch/protocol/okhttp/package-info.java | 4 +++- .../org/apache/nutch/protocol/selenium/package-info.java | 5 - 7 files changed, 30 insertions(+), 14 deletions(-) diff --git a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/package-info.java b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/package-info.java index a99b4bac7..8cacc3eef 100644 --- a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/package-info.java +++ b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/package-info.java @@ -17,6 +17,6 @@ /** * Common API used by HTTP plugins ({@link org.apache.nutch.protocol.http http}, - * {@link org.apache.nutch.protocol.httpclient httpclient}) + * {@link org.apache.nutch.protocol.httpclient httpclient}, etc.) */ package org.apache.nutch.protocol.http.api; diff --git a/src/plugin/protocol-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/package-info.java b/src/plugin/protocol-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/package-info.java index bf4902c25..80fabce3b 100644 --- a/src/plugin/protocol-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/package-info.java +++ b/src/plugin/protocol-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/package-info.java @@ -15,5 +15,11 @@ * limitations under the License. */ -/** Protocol plugin which supports retrieving documents via the http protocol.*/ +/** + * Protocol plugin which supports retrieving documents via HTTP/HTTPS using + * https://www.selenium.dev/;>Selenium and the + * https://github.com/SeleniumHQ/htmlunit-driver;>HtmlUnitDriver web + * driver for the for the + * https://htmlunit.sourceforge.io/;>HtmlUnit headless browser. + */ package org.apache.nutch.protocol.htmlunit; diff --git a/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/package-info.java b/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/package-info.java index 251204485..e3c390355 100644 --- a/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/package-info.java +++ b/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/package-info.java @@ -15,12 +15,12 @@ * limitations under the License. */ -/** - * Protocol plugin which supports retrieving documents via the - * HTTP andHTTPS protocols, optionally with Basic, Digest and - * NTLM authentication schemes for web server as well as - * proxy server. It handles cookies within a single fetch - * operation. This plugin is based on Jakarta Commons - * HttpClient library. +/** + * Protocol plugin which supports retrieving documents via the HTTP andHTTPS + * protocols, optionally with Basic, Digest and NTLM authentication schemes for + * web server as well as proxy server. It handles cookies within a single fetch + * operation and offers support for POST authentication via HTML forms. This + * plugin is based on the https://hc.apache.org/;>Apache HttpClient + * library. */ package org.apache.nutch.protocol.httpclient; diff --git a/src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/package-info.java b/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/package-info.java similarity index 78% copy from src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/package-info.java copy to src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/package-info.java index 7bdf14a75..407cb7fc8 100644 --- a/src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/package-info.java +++ b/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/package-info.java @@ -16,6 +16,8 @@ */ /** - * Protocol plugin based on https://github.com/square/okhttp;>okhttp, supports http, https, http/2. + * Handler implementations to interact with + * https://www.selenium.dev/;>Selen
[nutch] branch master updated: NUTCH-2930 Protocol-okhttp: implement IP filter (#736)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 7e969eaec NUTCH-2930 Protocol-okhttp: implement IP filter (#736) 7e969eaec is described below commit 7e969eaec1ab8e9e21667faf6cf1881fb10cfb31 Author: Sebastian Nagel AuthorDate: Fri Aug 19 15:26:07 2022 +0200 NUTCH-2930 Protocol-okhttp: implement IP filter (#736) - add include/exclude rules as list of IP address, CIDR notation or predefined IP ranges (localhost, loopback, sitelocal) --- conf/nutch-default.xml | 25 +++ .../org/apache/nutch/protocol/okhttp/CIDR.java | 79 .../nutch/protocol/okhttp/IPFilterRules.java | 129 + .../org/apache/nutch/protocol/okhttp/OkHttp.java | 35 .../protocol/okhttp/TestBadServerResponses.java| 2 +- .../protocol/okhttp/TestIPAddressFiltering.java| 207 + .../nutch/protocol/okhttp/TestProtocolOkHttp.java | 2 +- .../protocol/AbstractHttpProtocolPluginTest.java | 22 ++- 8 files changed, 494 insertions(+), 7 deletions(-) diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index 1ad02a021..2a6325884 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -449,6 +449,31 @@ + + http.filter.ipaddress.include + + +If not empty: only fetch content from these IP addresses defined +as a comma-separated list of a single IP address, a CIDR notation, +or one of the following pre-defined IP address types: localhost, +loopback, sitelocal. The property http.filter.ipaddress.exclude +can be used to block subranges in the included list of ranges. +Note: supported only by protocol-okhttp. + + + + + http.filter.ipaddress.exclude + + +If not empty: do not fetch content from these IP addresses defined +as a comma-separated list of a single IP address, a CIDR notation, +or one of the following pre-defined IP address types: localhost, +loopback, sitelocal. Note: supported only by protocol-okhttp. + + + + diff --git a/src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/CIDR.java b/src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/CIDR.java new file mode 100644 index 0..3add082a8 --- /dev/null +++ b/src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/CIDR.java @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.protocol.okhttp; + +import java.net.InetAddress; + +import com.google.common.net.InetAddresses; + +/** + * Parse a https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing;>CIDR block + * notation and test whether an IP address is contained in the subnet range + * defined by the CIDR. + */ +public class CIDR { + InetAddress addr; + int mask; + + public CIDR(InetAddress address, int mask) { +this.addr = address; +this.mask = mask; + } + + public CIDR(String cidr) throws IllegalArgumentException { +String ipStr = cidr; +int sep = cidr.indexOf('/'); +if (sep > -1) { + ipStr = cidr.substring(0, sep); +} +addr = InetAddresses.forString(ipStr); +if (sep > -1) { + mask = Integer.parseInt(cidr.substring(sep + 1)); +} else { + mask = addr.getAddress().length * 8; +} +if (cidr.indexOf(':') > -1 && addr.getAddress().length == 4) { + // IPv4-mapped IPv6 addresses are automatically converted to IPv4, + // need to shift the mask + mask = Math.max(0, mask - 96); +} + } + + public boolean contains(InetAddress address) { +byte[] addr0 = addr.getAddress(); +byte[] addr1 = address.getAddress(); +if (addr0.length != addr1.length) { + // not comparing IPv4 and IPv6 addresses + return false; +} +for (int i = 0; i < addr0.length; i++) { + int remainingMaskBits = mask - (i * 8); + if (remainingMaskBits <= 0) +return true; + int m = ~(0xff >> remainingMaskBits); // mask for byte under cursor + if ((ad
[nutch] branch master updated (c0f723e99 -> 05afebd03)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from c0f723e99 NUTCH-2957 indexer-solr / Solr schema.xml - add fall-back field definitions for unknown index fields - update comments and descriptions - fix indentation new dfe430b5a NUTCH-2861 Remove parse-swf new 1aec06f41 Upgrade to Apache Rat 0.14 (download of Rat 0.13 failed) new ddca1c252 NUTCH-2822 Split the LICENSE.txt file into two files for source resp. binary releases new eba8f3842 NUTCH-2290 Update licenses of bundled libraries - update year in NOTICE files: follow schema used by Hadoop and Spark projects (" and onwards") - change "developed by The ASF" -> "developed at The ASF" following https://infra.apache.org/licensing-howto.html#bundle-asf-product new a10713114 NUTCH-2290 Update licenses of bundled libraries - move "export control notice" from README to NOTICE files (following the schema used by Hadoop and Spark) - update "export control notice" following the scheme used by Apache Tika new 1d1eb6360 NUTCH-2290 Update licenses of bundled libraries - ivy license report: add homepage URL of dependencies new 2fbd30976 NUTCH-2290 Update licenses of bundled libraries NUTCH-2821 Deduplicate licenses in LICENSE.txt file - LICENSE-binary: list dependencies by license (this also deduplicates licenses) new 78f6f4058 NUTCH-2290 Update licenses of bundled libraries - NOTICE-binary: add Apache projects and links to the projects' NOTICE files - NOTICE-binary: add other software projects with links to the project homepage and the used license - add all licenses (different from the Apache 2.0 license) used by dependencies shipped in the binary package new 9a59ec9f0 NUTCH-2290 Update licenses of bundled libraries UTCH-2822 Split the LICENSE.txt file into two files for source resp. binary releases - ensure the binary license and notice files are shipped with the source and binary packages new 957d460c8 NUTCH-2290 Update licenses of bundled libraries - update the pull-request template and add updating licenses as a potential to-do new 05afebd03 Merge pull request #743 from sebastian-nagel/NUTCH-2290-update-licenses The 3319 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .github/pull_request_template.md |3 + CHANGES.txt|6 + LICENSE-binary | 786 +++ LICENSE.txt| 5534 NOTICE-binary | 1170 + NOTICE.txt | 37 +- README.md | 23 - build.xml | 18 +- conf/parse-plugins.xml.template|6 - default.properties |1 - ivy/ivy-report-license.xsl |4 +- licenses-binary/LICENSE-bouncy-castle-licence.txt | 17 + licenses-binary/LICENSE-bsd-2-clause.txt | 18 + licenses-binary/LICENSE-bsd-3-clause.txt | 41 + licenses-binary/LICENSE-bsd.txt| 206 + licenses-binary/LICENSE-cddl-1.0.txt | 175 + licenses-binary/LICENSE-cddl-1.1.txt | 756 +++ licenses-binary/LICENSE-cddl-gplv2-ce.txt | 3176 +++ licenses-binary/LICENSE-cddl-license.txt | 175 + licenses-binary/LICENSE-common-public-license.txt | 220 + licenses-binary/LICENSE-cpl.txt| 217 + .../LICENSE-eclipse-distribution-license-v1.0.txt | 30 + licenses-binary/LICENSE-epl-2.0.txt| 90 + ...version-2-gpl2-with-the-classpath-exception.txt | 15 + ...y-extreme-lab-software-license-vesion-1.1.1.txt |0 licenses-binary/LICENSE-mit-license.txt| 10 + .../LICENSE-mozilla-public-license-1.1-mpl-1.1.txt | 379 ++ .../LICENSE-mozilla-public-license-version-2.0.txt | 375 ++ ...ENSE-public-domain-per-creative-commons-cc0.txt | 32 + licenses-binary/LICENSE-public-domain.txt | 18 + licenses-binary/LICENSE-the-go-license.txt | 29 + licenses-binary/LICENSE-unicode-icu-license.txt| 521 ++ licenses-binary/LICENSE-unrar-license.txt | 43 + src/plugin/build.xml |3 - src/plugin/parse-swf/build.xml | 38 - src/plugin/parse-swf/ivy.xml | 41 - src/plugin/parse-swf/lib/javaswf-LICENSE.txt | 33 - src/plugin/parse-swf/lib/javaswf.ja
[nutch] branch master updated (edebfe49f -> c0f723e99)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from edebfe49f NUTCH-2955 indexer-solr: replace deprecated/removed field type solr.LatLonType add c0f723e99 NUTCH-2957 indexer-solr / Solr schema.xml - add fall-back field definitions for unknown index fields - update comments and descriptions - fix indentation No new revisions were added by this update. Summary of changes: src/plugin/indexer-solr/schema.xml | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-)
[nutch] branch master updated (a5a630055 -> edebfe49f)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from a5a630055 Merge pull request #729 from sebastian-nagel/NUTCH-2947-keep-stateful-fetch-queues add edebfe49f NUTCH-2955 indexer-solr: replace deprecated/removed field type solr.LatLonType No new revisions were added by this update. Summary of changes: src/plugin/indexer-solr/schema.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
[nutch] branch master updated (82f9530dc -> a5a630055)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from 82f9530dc Merge pull request #697 from sebastian-nagel/NUTCH-2896-okhttp-connection-pool new c862d2409 NUTCH-2947 Fetcher: keep state of empty but stateful fetch queues unless queue feeder is finished in order to ensure politeness - next fetch time not yet reached - non-zero exception counter and queue feeder still adding new fetch items to queues Only if the the queue feeder is finished and no more new fetch items are added, these queues can finally removed. new 8cfa53f7d NUTCH-2947 Fetcher: keep state of empty but stateful fetch queues - also keep state if `fetcher.exceptions.per.queue.delay` > 0.0 new a5a630055 Merge pull request #729 from sebastian-nagel/NUTCH-2947-keep-stateful-fetch-queues The 3306 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../org/apache/nutch/fetcher/FetchItemQueues.java | 19 +-- src/java/org/apache/nutch/fetcher/QueueFeeder.java| 4 +++- 2 files changed, 20 insertions(+), 3 deletions(-)
[nutch] branch master updated (b7b834501 -> 82f9530dc)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from b7b834501 NUTCH-2958 Upgrade to crawler-commons 1.3 (#740) new af44bcb6f NUTCH-2896 Protocol-okhttp: make connection pool configurable - add configuration property `http.connection.pool.okhttp` to configure the number of connection pools, their size and the keep-alive time of the pooled connections - create as many clients as pools are configured, each client holding one pool. Distribute connections by target host name over clients new 467e59105 NUTCH-2896 Protocol-okhttp: make connection pool configurable - fix javadoc error new 82f9530dc Merge pull request #697 from sebastian-nagel/NUTCH-2896-okhttp-connection-pool The 3303 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: conf/nutch-default.xml | 21 .../org/apache/nutch/protocol/okhttp/OkHttp.java | 59 -- .../nutch/protocol/okhttp/OkHttpResponse.java | 2 +- 3 files changed, 77 insertions(+), 5 deletions(-)
[nutch] branch master updated (8fc4f17ac -> b7b834501)
This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from 8fc4f17ac NUTCH-2956 index-geoip: dependency upgrades and improvements - upgrade to geoip2 3.0.1 - exclude transitive dependencies (Jackson) provided as Nutch core deps - read also GeoLite2-*.mmdb files - review index field names in plugin and Nutch Solr schema: - fix typos in field names - remove unused fields from schema add b7b834501 NUTCH-2958 Upgrade to crawler-commons 1.3 (#740) No new revisions were added by this update. Summary of changes: ivy/ivy.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
[nutch] branch master updated: NUTCH-2956 index-geoip: dependency upgrades and improvements - upgrade to geoip2 3.0.1 - exclude transitive dependencies (Jackson) provided as Nutch core deps - read als
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 8fc4f17ac NUTCH-2956 index-geoip: dependency upgrades and improvements - upgrade to geoip2 3.0.1 - exclude transitive dependencies (Jackson) provided as Nutch core deps - read also GeoLite2-*.mmdb files - review index field names in plugin and Nutch Solr schema: - fix typos in field names - remove unused fields from schema 8fc4f17ac is described below commit 8fc4f17acc5da28c22ef4e77c2316e20e5976f02 Author: Sebastian Nagel AuthorDate: Sat Aug 6 15:04:10 2022 +0200 NUTCH-2956 index-geoip: dependency upgrades and improvements - upgrade to geoip2 3.0.1 - exclude transitive dependencies (Jackson) provided as Nutch core deps - read also GeoLite2-*.mmdb files - review index field names in plugin and Nutch Solr schema: - fix typos in field names - remove unused fields from schema --- conf/nutch-default.xml | 3 +- src/plugin/index-geoip/ivy.xml | 11 +++-- src/plugin/index-geoip/plugin.xml | 7 +--- .../nutch/indexer/geoip/GeoIPDocumentCreator.java | 49 -- .../nutch/indexer/geoip/GeoIPIndexingFilter.java | 34 --- src/plugin/indexer-solr/schema.xml | 3 +- 6 files changed, 57 insertions(+), 50 deletions(-) diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index 7faa6fdcd..bb9aae1b3 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -2112,7 +2112,8 @@ Add scoring-metadata to the list of active plugins 'domainDatabase', 'ispDatabase' or 'insightsService'. If you wish to use any one of the Database options, you should make one of GeoIP2-City.mmdb, GeoIP2-Connection-Type.mmdb, GeoIP2-Domain.mmdb or GeoIP2-ISP.mmdb files respectively available on the classpath and - available at runtime. + available at runtime. Alternatively, also the GeoLite2 IP databases (GeoLite2-*.mmdb) + can be used. diff --git a/src/plugin/index-geoip/ivy.xml b/src/plugin/index-geoip/ivy.xml index 4fa6f71a7..2eda5a63f 100644 --- a/src/plugin/index-geoip/ivy.xml +++ b/src/plugin/index-geoip/ivy.xml @@ -36,12 +36,11 @@ - - - - - - + + + + + diff --git a/src/plugin/index-geoip/plugin.xml b/src/plugin/index-geoip/plugin.xml index 6148f59e5..c4efadf94 100644 --- a/src/plugin/index-geoip/plugin.xml +++ b/src/plugin/index-geoip/plugin.xml @@ -25,11 +25,8 @@ - - - - - + + diff --git a/src/plugin/index-geoip/src/java/org/apache/nutch/indexer/geoip/GeoIPDocumentCreator.java b/src/plugin/index-geoip/src/java/org/apache/nutch/indexer/geoip/GeoIPDocumentCreator.java index 1c697a205..64b3862be 100644 --- a/src/plugin/index-geoip/src/java/org/apache/nutch/indexer/geoip/GeoIPDocumentCreator.java +++ b/src/plugin/index-geoip/src/java/org/apache/nutch/indexer/geoip/GeoIPDocumentCreator.java @@ -17,13 +17,17 @@ package org.apache.nutch.indexer.geoip; import java.io.IOException; +import java.lang.invoke.MethodHandles; import java.net.InetAddress; import java.net.UnknownHostException; import org.apache.nutch.indexer.NutchDocument; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; import com.maxmind.geoip2.DatabaseReader; import com.maxmind.geoip2.WebServiceClient; +import com.maxmind.geoip2.exception.AddressNotFoundException; import com.maxmind.geoip2.exception.GeoIp2Exception; import com.maxmind.geoip2.model.InsightsResponse; import com.maxmind.geoip2.model.CityResponse; @@ -54,28 +58,17 @@ import com.maxmind.geoip2.record.Traits; */ public class GeoIPDocumentCreator { - /** - * Add field to document but only if value isn't null - * @param doc the {@link NutchDocument} to augment - * @param name the name of the target field - * @param value the String value to associate with the target field - */ - public static void addIfNotNull(NutchDocument doc, String name, - String value) { -if (value != null) { - doc.add(name, value); -} - } + private static final Logger LOG = LoggerFactory + .getLogger(MethodHandles.lookup().lookupClass()); /** * Add field to document but only if value isn't null * @param doc the {@link NutchDocument} to augment * @param name the name of the target field - * @param value the {@link java.lang.Integer} value to - * associate with the target field + * @param value the String value to associate with the target field */ public static void addIfNotNull(NutchDocument doc, String name, - Integer value) { + Object value) { if (value != null) { doc.add(name, value); } @@ -87,7 +80,6 @@ public class
[nutch] branch master updated: NUTCH-2953 Indexer Elastic to ignore SSL issues - apply patch contributed by Markus Jelsma - fix class imports
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 01ab00b6c NUTCH-2953 Indexer Elastic to ignore SSL issues - apply patch contributed by Markus Jelsma - fix class imports 01ab00b6c is described below commit 01ab00b6cd8dbba8abbf1d3840a09bab929c6af0 Author: Sebastian Nagel AuthorDate: Mon Aug 8 16:19:24 2022 +0200 NUTCH-2953 Indexer Elastic to ignore SSL issues - apply patch contributed by Markus Jelsma - fix class imports --- .../indexwriter/elastic/ElasticIndexWriter.java| 31 +- 1 file changed, 30 insertions(+), 1 deletion(-) diff --git a/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java b/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java index 7885a5210..053bfd68a 100644 --- a/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java +++ b/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java @@ -25,14 +25,20 @@ import java.util.List; import java.util.Map; import java.util.concurrent.TimeUnit; +import javax.net.ssl.SSLContext; + import org.apache.commons.lang.StringUtils; import org.apache.hadoop.conf.Configuration; import org.apache.http.HttpHost; import org.apache.http.auth.AuthScope; import org.apache.http.auth.UsernamePasswordCredentials; import org.apache.http.client.CredentialsProvider; +import org.apache.http.conn.ssl.NoopHostnameVerifier; +import org.apache.http.conn.ssl.TrustSelfSignedStrategy; import org.apache.http.impl.client.BasicCredentialsProvider; import org.apache.http.impl.nio.client.HttpAsyncClientBuilder; +import org.apache.http.ssl.SSLContextBuilder; +import org.apache.http.ssl.SSLContexts; import org.apache.nutch.indexer.IndexWriter; import org.apache.nutch.indexer.IndexWriterParams; import org.apache.nutch.indexer.NutchDocument; @@ -181,6 +187,7 @@ public class ElasticIndexWriter implements IndexWriter { hostsList[i++] = new HttpHost(host, port, scheme); } RestClientBuilder restClientBuilder = RestClient.builder(hostsList); + if (auth) { restClientBuilder .setHttpClientConfigCallback(new HttpClientConfigCallback() { @@ -191,6 +198,28 @@ public class ElasticIndexWriter implements IndexWriter { } }); } + + // In case of HTTPS, set the client up for ignoring problems with self-signed + // certificates and stuff + if ("https".equals(scheme)) { +try { + SSLContextBuilder sslBuilder = SSLContexts.custom(); + sslBuilder.loadTrustMaterial(null, new TrustSelfSignedStrategy()); + final SSLContext sslContext = sslBuilder.build(); + + restClientBuilder.setHttpClientConfigCallback(new HttpClientConfigCallback() { +@Override +public HttpAsyncClientBuilder customizeHttpClient(HttpAsyncClientBuilder httpClientBuilder) { + // ignore issues with self-signed certificates + httpClientBuilder.setSSLHostnameVerifier(NoopHostnameVerifier.INSTANCE); + return httpClientBuilder.setSSLContext(sslContext); +} + }); +} catch (Exception e) { + LOG.error("Error setting up SSLContext because: " + e.getMessage(), e); +} + } + client = new RestHighLevelClient(restClientBuilder); } else { throw new IOException( @@ -344,4 +373,4 @@ public class ElasticIndexWriter implements IndexWriter { public Configuration getConf() { return config; } -} \ No newline at end of file +}
[nutch] branch master updated: NUTCH-2952 Upgrade core dependencies - Hadoop 3.1.3 -> 3.3.3 - log4j 2.17.0 -> 2.17.2 - and some more
This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new e71841fd0 NUTCH-2952 Upgrade core dependencies - Hadoop 3.1.3 -> 3.3.3 - log4j 2.17.0 -> 2.17.2 - and some more e71841fd0 is described below commit e71841fd0f1777ece6dde2115ea7c5b036bb13f1 Author: Sebastian Nagel AuthorDate: Wed Jun 15 17:07:07 2022 +0200 NUTCH-2952 Upgrade core dependencies - Hadoop 3.1.3 -> 3.3.3 - log4j 2.17.0 -> 2.17.2 - and some more --- ivy/ivy.xml | 40 + src/plugin/publish-rabbitmq/ivy.xml | 2 +- 2 files changed, 19 insertions(+), 23 deletions(-) diff --git a/ivy/ivy.xml b/ivy/ivy.xml index a03bce45f..12fa6d94c 100644 --- a/ivy/ivy.xml +++ b/ivy/ivy.xml @@ -36,10 +36,10 @@ - - - - + + + + @@ -50,7 +50,7 @@ - + @@ -58,23 +58,23 @@ - - - + + + - + - + - + - + @@ -84,10 +84,10 @@ - - - - + + + + @@ -111,16 +111,12 @@ - - - + + - - - diff --git a/src/plugin/publish-rabbitmq/ivy.xml b/src/plugin/publish-rabbitmq/ivy.xml index dd450cf7f..7b5e3dd3c 100644 --- a/src/plugin/publish-rabbitmq/ivy.xml +++ b/src/plugin/publish-rabbitmq/ivy.xml @@ -34,5 +34,5 @@ - +