This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git.
from 466cac5 Merge pull request #548 from sebastian-nagel/NUTCH-2817-spotbugs-object-equality new 7f51c25 NUTCH-2743 Add list of Nutch properties (nutch-default.xml) to documentation - modify ant build.xml to copy nutch-default.xml into docs/api/resources/ - adapt XSLT table layout - remove obsolete nutch-conf.xsl - fix typos and normalize spelling in nutch-default.xml new 1b27ab8 NUTCH-2434 Add methods to reset parameters HTMLMetaTags (apply patch contributed by Markus) new 814f8b9 NUTCH-1194 Generator: CrawlDB lock should be released earlier - release CrawlDb lock after select step, in case, generated items are not marked in CrawlDb (generate.update.crawldb is false) new 06b2271 NUTCH-2785 FreeGenerator: command-line option to define number of generated fetch lists - add command-line option `-numFetchers` to FreeGenerator - in local mode: generate one single fetch list new aed6fa7 NUTCH-2002 parse and index checkers to check robots.txt - applied Julien's patch to recent code base - also check redirects whether they are allowed - add command-line parameter `-checkRobotsTxt` enabling this check new 72b941f NUTCH-2753 Add -listen option to command-line help of CrawlDbReader and LinkDbReader new 495f0ea NUTCH-2758 Add plugin READMEs to binary release packages new 3759019 NUTCH-1945 Test for XLSX parser - add Tika unit test for XLSX files - bundle instance variables and utility methods in class TikaParserTest - clean up javadoc comments new 79f3c0a NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file new 83011a0 NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file new 5087151 NUTCH-2720 ROBOTS metatag ignored when capitalized new fa319a6 NUTCH-2720 ROBOTS metatag ignored when capitalized - move string "robots" to constant in metadata.Nutch - make string lowercase not depend on system locale new ea6b2f0 NUTCH-2496 Speed up link inversion step in crawling script new 6c65498 NUTCH-2790 indexer-csv: escape field leading quote character new 41d3eb1 NUTCH-2787 CrawlDb JSON dump does not export metadata primitive data types correctly - add JsonSerializer to write common Writable types (null, boolean, numbers) - remaining "unknown" Writables are written after calling toString() new e8673d1 NUTCH-2788 ParseData: improve presentation of Metadata in method toString() - switch to multi-line presentation of Metadata in ParseData::toString - default implementation of Metadata::toString is still single-line - replace StringBuffer by StringBuilder in modified methods new f08c9db NUTCH-2789 Docker README: update links to point to cwiki new 75e4e63 NUTCH-2789 Documentation: update links to point to cwiki new 5649513 NUTCH-2791 Handle GCS URLs in stats commands new 38f6f56 NUTCH-2794 Add additional ciphers to HTTP base's default cipher suite new 4b505f2 Prepare for new development after release of 1.17 - bump version number (1.17-SNAPSHOT -> 1.18-SNAPSHOT) - add 1.17 changes / release notes - update links to Hadoop and Solr API docs - update current year in API docs etc. new 6fb5ebb [NUTCH-2796] Upgrade to crawler-commons 1.1 new 7b16354 [NUTCH-2730] SitemapProcessor to treat sitemap URLs as Set instead of List - sitemap links from robots.txt are treated as set by crawler-commons (since crawler-commons 1.1) - sitemaps referenced in sitemap index are deduplicated new 50eba77 NUTCH-2782: protocol-http / lib-http: support TLSv1.3 new 4cc6048 NUTCH-2805: Rename plugin urlfilter-domainblacklist (#540) new 669e5a1 NUTCH-2799 Add .asf.yaml file - add project description in one sentence - add github topics - set github mailing list notifications as configured before new 0d6447a NUTCH-2799 Add .asf.yaml file - update pull request template regarding Jira linking: issue id should be in square brackets (`[NUTCH-XXXX]`) new 2c3d864 NUTCH-1190 MoreIndexingFilter: move data formats used to parse "lastModified" to a config file new d3d3b31 [NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back - if no agent names are given as command-line arguments use values of http.agent.name and http.robots.agents as agent names to be checked - update command-line help new a73bd14 [NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back - clarify comment regarding bypassing the confidence check for a non-empty http.agent.name new a51b0f5 NUTCH-2810 FreeGenerator to actually apply configured number of fetch lists new b4b81f7 NUTCH-2811 : Setup Github workflows for prs (#543) new e7a3da3 NUTCH-2816 Add Spotbugs target to ant build - called on-demand as ant target "spotbugs" - creates spotbugs report ("build/nutch-spotbugs.html") covering Nutch core and plugins new 69deffa NUTCH-2817 Avoid check for equality of URL path and file part using ==/!= - replace check whether URL path and file are identical by check whether URL has a query - clean up code and improve log messages new ae844b6 Merge branch 'derhecht-patch-2', closes #545 The 35 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: conf/date-styles.txt.template | 52 +++++++++ .../nutch/indexer/more/MoreIndexingFilter.java | 123 ++++++++++++++------- 2 files changed, 132 insertions(+), 43 deletions(-) create mode 100644 conf/date-styles.txt.template