This is an automated email from the ASF dual-hosted git repository.
snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.
from 466cac5 Merge pull request #548 from
sebastian-nagel/NUTCH-2817-spotbugs-object-equality
new 7f51c25 NUTCH-2743 Add list of Nutch properties (nutch-default.xml)
to documentation - modify ant build.xml to copy nutch-default.xml into
docs/api/resources/ - adapt XSLT table layout - remove obsolete nutch-conf.xsl
- fix typos and normalize spelling in nutch-default.xml
new 1b27ab8 NUTCH-2434 Add methods to reset parameters HTMLMetaTags
(apply patch contributed by Markus)
new 814f8b9 NUTCH-1194 Generator: CrawlDB lock should be released earlier
- release CrawlDb lock after select step, in case, generated items are not
marked in CrawlDb (generate.update.crawldb is false)
new 06b2271 NUTCH-2785 FreeGenerator: command-line option to define
number of generated fetch lists - add command-line option `-numFetchers` to
FreeGenerator - in local mode: generate one single fetch list
new aed6fa7 NUTCH-2002 parse and index checkers to check robots.txt -
applied Julien's patch to recent code base - also check redirects whether they
are allowed - add command-line parameter `-checkRobotsTxt` enabling this check
new 72b941f NUTCH-2753 Add -listen option to command-line help of
CrawlDbReader and LinkDbReader
new 495f0ea NUTCH-2758 Add plugin READMEs to binary release packages
new 3759019 NUTCH-1945 Test for XLSX parser - add Tika unit test for XLSX
files - bundle instance variables and utility methods in class TikaParserTest -
clean up javadoc comments
new 79f3c0a NUTCH-2419 Some URL filters and normalizers do not respect
command-line override for rule file
new 83011a0 NUTCH-2419 Some URL filters and normalizers do not respect
command-line override for rule file
new 5087151 NUTCH-2720 ROBOTS metatag ignored when capitalized
new fa319a6 NUTCH-2720 ROBOTS metatag ignored when capitalized - move
string "robots" to constant in metadata.Nutch - make string lowercase not
depend on system locale
new ea6b2f0 NUTCH-2496 Speed up link inversion step in crawling script
new 6c65498 NUTCH-2790 indexer-csv: escape field leading quote character
new 41d3eb1 NUTCH-2787 CrawlDb JSON dump does not export metadata
primitive data types correctly - add JsonSerializer to write common Writable
types (null, boolean, numbers) - remaining "unknown" Writables are written
after calling toString()
new e8673d1 NUTCH-2788 ParseData: improve presentation of Metadata in
method toString() - switch to multi-line presentation of Metadata in
ParseData::toString - default implementation of Metadata::toString is still
single-line - replace StringBuffer by StringBuilder in modified methods
new f08c9db NUTCH-2789 Docker README: update links to point to cwiki
new 75e4e63 NUTCH-2789 Documentation: update links to point to cwiki
new 5649513 NUTCH-2791 Handle GCS URLs in stats commands
new 38f6f56 NUTCH-2794 Add additional ciphers to HTTP base's default
cipher suite
new 4b505f2 Prepare for new development after release of 1.17 - bump
version number (1.17-SNAPSHOT -> 1.18-SNAPSHOT) - add 1.17 changes / release
notes - update links to Hadoop and Solr API docs - update current year in API
docs etc.
new 6fb5ebb [NUTCH-2796] Upgrade to crawler-commons 1.1
new 7b16354 [NUTCH-2730] SitemapProcessor to treat sitemap URLs as Set
instead of List - sitemap links from robots.txt are treated as set by
crawler-commons (since crawler-commons 1.1) - sitemaps referenced in sitemap
index are deduplicated
new 50eba77 NUTCH-2782: protocol-http / lib-http: support TLSv1.3
new 4cc6048 NUTCH-2805: Rename plugin urlfilter-domainblacklist (#540)
new 669e5a1 NUTCH-2799 Add .asf.yaml file - add project description in
one sentence - add github topics - set github mailing list notifications as
configured before
new 0d6447a NUTCH-2799 Add .asf.yaml file - update pull request template
regarding Jira linking: issue id should be in square brackets (`[NUTCH-XXXX]`)
new 2c3d864 NUTCH-1190 MoreIndexingFilter: move data formats used to
parse "lastModified" to a config file
new d3d3b31 [NUTCH-2801] RobotsRulesParser command-line checker to use
http.robots.agents as fall-back - if no agent names are given as command-line
arguments use values of http.agent.name and http.robots.agents as agent names
to be checked - update command-line help
new a73bd14 [NUTCH-2801] RobotsRulesParser command-line checker to use
http.robots.agents as fall-back - clarify comment regarding bypassing the
confidence check for a non-empty http.agent.name
new a51b0f5 NUTCH-2810 FreeGenerator to actually apply configured number
of fetch lists
new b4b81f7 NUTCH-2811 : Setup Github workflows for prs (#543)
new e7a3da3 NUTCH-2816 Add Spotbugs target to ant build - called
on-demand as ant target "spotbugs" - creates spotbugs report
("build/nutch-spotbugs.html") covering Nutch core and plugins
new 69deffa NUTCH-2817 Avoid check for equality of URL path and file part
using ==/!= - replace check whether URL path and file are identical by check
whether URL has a query - clean up code and improve log messages
new ae844b6 Merge branch 'derhecht-patch-2', closes #545
The 35 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.
Summary of changes:
conf/date-styles.txt.template | 52 +++++++++
.../nutch/indexer/more/MoreIndexingFilter.java | 123 ++++++++++++++-------
2 files changed, 132 insertions(+), 43 deletions(-)
create mode 100644 conf/date-styles.txt.template