This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.


    from 466cac5  Merge pull request #548 from 
sebastian-nagel/NUTCH-2817-spotbugs-object-equality
     new 7f51c25  NUTCH-2743 Add list of Nutch properties (nutch-default.xml) 
to documentation - modify ant build.xml to copy nutch-default.xml into 
docs/api/resources/ - adapt XSLT table layout - remove obsolete nutch-conf.xsl 
- fix typos and normalize spelling in nutch-default.xml
     new 1b27ab8  NUTCH-2434 Add methods to reset parameters HTMLMetaTags 
(apply patch contributed by Markus)
     new 814f8b9  NUTCH-1194 Generator: CrawlDB lock should be released earlier 
- release CrawlDb lock after select step, in case, generated items   are not 
marked in CrawlDb (generate.update.crawldb is false)
     new 06b2271  NUTCH-2785 FreeGenerator: command-line option to define 
number of generated fetch lists - add command-line option `-numFetchers` to 
FreeGenerator - in local mode: generate one single fetch list
     new aed6fa7  NUTCH-2002 parse and index checkers to check robots.txt - 
applied Julien's patch to recent code base - also check redirects whether they 
are allowed - add command-line parameter `-checkRobotsTxt` enabling this check
     new 72b941f  NUTCH-2753 Add -listen option to command-line help of 
CrawlDbReader and LinkDbReader
     new 495f0ea  NUTCH-2758 Add plugin READMEs to binary release packages
     new 3759019  NUTCH-1945 Test for XLSX parser - add Tika unit test for XLSX 
files - bundle instance variables and utility methods in class TikaParserTest - 
clean up javadoc comments
     new 79f3c0a  NUTCH-2419 Some URL filters and normalizers do not respect 
command-line override for rule file
     new 83011a0  NUTCH-2419 Some URL filters and normalizers do not respect 
command-line override for rule file
     new 5087151  NUTCH-2720 ROBOTS metatag ignored when capitalized
     new fa319a6  NUTCH-2720 ROBOTS metatag ignored when capitalized - move 
string "robots" to constant in metadata.Nutch - make string lowercase not 
depend on system locale
     new ea6b2f0  NUTCH-2496 Speed up link inversion step in crawling script
     new 6c65498  NUTCH-2790 indexer-csv: escape field leading quote character
     new 41d3eb1  NUTCH-2787 CrawlDb JSON dump does not export metadata 
primitive data types correctly - add JsonSerializer to write common Writable 
types (null, boolean, numbers) - remaining "unknown" Writables are written 
after calling toString()
     new e8673d1  NUTCH-2788 ParseData: improve presentation of Metadata in 
method toString() - switch to multi-line presentation of Metadata in 
ParseData::toString - default implementation of Metadata::toString is still 
single-line - replace StringBuffer by StringBuilder in modified methods
     new f08c9db  NUTCH-2789 Docker README: update links to point to cwiki
     new 75e4e63  NUTCH-2789 Documentation: update links to point to cwiki
     new 5649513  NUTCH-2791 Handle GCS URLs in stats commands
     new 38f6f56  NUTCH-2794 Add additional ciphers to HTTP base's default 
cipher suite
     new 4b505f2  Prepare for new development after release of 1.17 - bump 
version number (1.17-SNAPSHOT -> 1.18-SNAPSHOT) - add 1.17 changes / release 
notes - update links to Hadoop and Solr API docs - update current year in API 
docs etc.
     new 6fb5ebb  [NUTCH-2796] Upgrade to crawler-commons 1.1
     new 7b16354  [NUTCH-2730] SitemapProcessor to treat sitemap URLs as Set 
instead of List - sitemap links from robots.txt are treated as set by 
crawler-commons   (since crawler-commons 1.1) - sitemaps referenced in sitemap 
index are deduplicated
     new 50eba77  NUTCH-2782: protocol-http / lib-http: support TLSv1.3
     new 4cc6048  NUTCH-2805: Rename plugin urlfilter-domainblacklist (#540)
     new 669e5a1  NUTCH-2799 Add .asf.yaml file - add project description in 
one sentence - add github topics - set github mailing list notifications as 
configured before
     new 0d6447a  NUTCH-2799 Add .asf.yaml file - update pull request template 
regarding Jira linking:   issue id should be in square brackets (`[NUTCH-XXXX]`)
     new 2c3d864  NUTCH-1190 MoreIndexingFilter: move data formats used to 
parse "lastModified" to a config file
     new d3d3b31  [NUTCH-2801] RobotsRulesParser command-line checker to use 
http.robots.agents as fall-back - if no agent names are given as command-line 
arguments use values of   http.agent.name and http.robots.agents as agent names 
to be checked - update command-line help
     new a73bd14  [NUTCH-2801] RobotsRulesParser command-line checker to use 
http.robots.agents as fall-back - clarify comment regarding bypassing the 
confidence check for a non-empty http.agent.name
     new a51b0f5  NUTCH-2810 FreeGenerator to actually apply configured number 
of fetch lists
     new b4b81f7  NUTCH-2811 : Setup Github workflows for prs (#543)
     new e7a3da3  NUTCH-2816 Add Spotbugs target to ant build - called 
on-demand as ant target "spotbugs" - creates spotbugs report 
("build/nutch-spotbugs.html") covering Nutch core and plugins
     new 69deffa  NUTCH-2817 Avoid check for equality of URL path and file part 
using ==/!= - replace check whether URL path and file are identical   by check 
whether URL has a query - clean up code and improve log messages
     new ae844b6  Merge branch 'derhecht-patch-2', closes #545

The 35 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 conf/date-styles.txt.template                      |  52 +++++++++
 .../nutch/indexer/more/MoreIndexingFilter.java     | 123 ++++++++++++++-------
 2 files changed, 132 insertions(+), 43 deletions(-)
 create mode 100644 conf/date-styles.txt.template

Reply via email to