commits
Thread
Date
Earlier messages
Later messages
Messages by Thread
[nutch-site] 01/01: Add .asf.yaml file to Nutch website
lewismc
[nutch-site] branch main updated: Remove broken site
lewismc
[nutch] branch master updated (75daf3e -> ff800c5)
snagel
[nutch] branch master updated (64fb604 -> 75daf3e)
snagel
[nutch] branch master updated (25ccf89 -> 64fb604)
snagel
[nutch] branch master updated: Upgrade to crawler-commons 1.2
snagel
[nutch] branch master updated: NUTCH-2902 Jexl parsing error on statements (contributed by Max Ockner) - use JexlScript instead of JexlExpression in Generator, CrawlDb/HostDb reader, Jexl exchange and indexing filter
snagel
[nutch] branch master updated: NUTCH-2899 Remove needless warning about missing o/a/rat/anttasks/antlib.xml - avoid needless warning by moving taskdef into task element
snagel
[nutch] branch master updated: NUTCH-2862 Do not include Ivy jar in source release package
snagel
[nutch] branch master updated: quick IntelliJ IDEA setup docs added (#698)
lewismc
[nutch] branch master updated (eeb9863 -> c48b8d1)
snagel
[nutch] branch master updated: NUTCH-2894 Java plugin compilation classpath: priorize plugin dependencies
snagel
[nutch] branch master updated: fireant upgrade dependency elasticsearch-rest-high-level-client in src/plugin/indexer-elastic/ivy.xml from 7.11.1 to 7.13.2 (#688)
lewismc
[nutch-site] branch main updated: Attempt to implement single page templating.
lewismc
[nutch-site] branch main created (now ae6f9f2)
lewismc
[nutch-site] 01/01: NUTCH-2826 Migrate Nutch Site from Apache CMS to Hugo
lewismc
[nutch] branch master updated: NUTCH-2885 Upgrade to Log4j2 (#692)
lewismc
[nutch-webapp] branch master updated: Add missing files
lewismc
[nutch-webapp] branch master created (now da3c282)
lewismc
[nutch] branch master updated: fireant upgrade dependency httpcore in ivy/ivy.xml from 4.4.9 to 4.4.14 (#681)
lewismc
[nutch] branch master updated: NUTCH-2882 Configure NutchUiServer for DEPLOYMENT and improve logging (#690)
lewismc
[nutch] branch master updated: NUTCH-2881 bug in 'nutch' symlink in docker container (#689)
lewismc
You have (8) new email notifications.
postmaster-delivery
[nutch] branch master updated: NUTCH-2868 urlnormalizer-protocol fails with StringIndexOutOfBoundsException when reading invalid line in configuration file - log invalid line and skip over it - more verbose logging which configuration file is read - add unit test to proof that invalid configuration lines are skipped
snagel
[nutch] branch master updated: fireant upgrade dependency junit in ivy/ivy.xml from 4.13.1 to 4.13.2 (#666)
lewismc
[nutch] branch master updated: NUTCH-2869 Add @Override annotations to Nutch plugins - add/complete @Override annotions for methods implementing interfaces - plugins implementing the ScoringFilter interface: extend AbstractScoringFilter and get rid of default method implementations - URL filters/normalizers: remove unused methods including a CrawlDatum parameter - improve Javadoc and documentation in build and config files
snagel
[nutch] branch master updated: NUTCH-2864 Upgrade Dockerfile to use JDK 11 (#647)
lewismc
[nutch] branch master updated: NUTCH-2866 Fix MetaData.toString() to return "key=value ..."
snagel
svn commit: r1074658 - /websites/production/nutch/content/
snagel
svn commit: r1074656 - in /websites/staging/nutch/trunk/content: ./ apidocs/apidocs-1.18/resources/ apidocs/apidocs-1.18/resources/configuration.xsl apidocs/apidocs-1.18/resources/nutch-default.xml
buildbot
svn commit: r1889891 [2/3] - in /nutch/cms_site/trunk/content/apidocs/apidocs-1.18/resources: ./ configuration.xsl nutch-default.xml
snagel
svn commit: r1889891 [1/3] - in /nutch/cms_site/trunk/content/apidocs/apidocs-1.18/resources: ./ configuration.xsl nutch-default.xml
snagel
svn commit: r1889891 [3/3] - in /nutch/cms_site/trunk/content/apidocs/apidocs-1.18/resources: ./ configuration.xsl nutch-default.xml
snagel
Retrieve your pending mails immediately.
Admin lucene . apache . org
[nutch] branch master updated (2837039 -> 6c02da0)
snagel
[nutch] branch master updated: NUTCH-2855 Update org.elasticsearch.client (#577)
lewismc
[nutch] branch master updated: NUTCH-2857 Upgrade from JDK1.8 --> JDK11 (#573)
lewismc
[nutch] branch master updated: NUTCH-2596 Upgrade from org.mortbay.jetty to org.eclipse.jetty - remove Jetty (serving JSP pages) for HTTP protocol plugin tests - replace JSP pages by header/content strings hold in unit test classes
snagel
[nutch] branch master updated: NUTCH-2850 Method ignores exceptional return value (#570)
lewismc
[nutch] branch master updated: NUTCH-2851 Random object created and used only once (#571)
lewismc
[nutch] branch master updated: NUTCH-2849 Replace remaining package.html files with package-info.java (#569)
lewismc
[nutch] branch master updated: NUTCH-2847 HttpDateFormat: Simplify based on new Java 8 DateTime API
snagel
[nutch] branch master updated (3483a41 -> 491b5c2)
snagel
[nutch] branch master updated (7ffc667 -> 3483a41)
snagel
[nutch] branch master updated: NUTCH-2845 Complete rules of urlfilter-suffix, add more excluded file suffixes for - images - audio and video formats - software packages and archives - fonts
snagel
[nutch] branch master updated: NUTCH-2840 Fix 'report-vulnerabilities' ant target in build.xml (#561)
lewismc
[nutch] branch master updated: NUTCH-2819 Move spotbugs "installation" directory to avoid that spotbugs is shipped in Nutch runtime (#565)
lewismc
[nutch] branch master updated: Prepare for Nutch 1.19-SNAPSHOT development
lewismc
svn commit: r1070562 - /websites/production/nutch/content/
lewismc
svn commit: r45580 - /release/nutch/1.17/
lewismc
svn commit: r1070514 - in /websites/staging/nutch/trunk/content: ./ apidocs/apidocs-1.18/ apidocs/apidocs-1.18/org/ apidocs/apidocs-1.18/org/apache/ apidocs/apidocs-1.18/org/apache/nutch/ apidocs/apidocs-1.18/org/apache/nutch/analysis/ apidocs/apidocs-...
buildbot
svn commit: r1885887 - in /nutch/cms_site/trunk/content: ./ apidocs/apidocs-1.18/ apidocs/apidocs-1.18/org/ apidocs/apidocs-1.18/org/apache/ apidocs/apidocs-1.18/org/apache/nutch/ apidocs/apidocs-1.18/org/apache/nutch/analysis/ apidocs/apidocs-1.18/org...
lewismc
svn commit: r45570 - in /release/nutch/1.18: apache-nutch-1.18-bin.zip.md5 apache-nutch-1.18-src.tar.gz.md5 apache-nutch-1.18-src.zip.md5
lewismc
svn commit: r45569 - /release/nutch/1.18/apache-nutch-1.18-bin.tar.gz.md5
lewismc
svn commit: r45568 - /dev/nutch/1.18/ /release/nutch/1.18/
lewismc
svn commit: r45520 [3/3] - /dev/nutch/1.18/
lewismc
svn commit: r45520 [2/3] - /dev/nutch/1.18/
lewismc
svn commit: r45520 [1/3] - /dev/nutch/1.18/
lewismc
[nutch] annotated tag release-1.18 updated (43f3550 -> a8ef299)
lewismc
[nutch] branch branch-1.18 updated: Prepare for Nutch 1.18 release
lewismc
[nutch] branch branch-1.18 created (now e9f125c)
lewismc
[nutch] 01/01: Prepare for Nutch 1.18 release
lewismc
[nutch] branch master updated: NUTCH-2841 Upgrade xercesImpl dependency (#563)
lewismc
[nutch] branch master updated: NUTCH-2837 Update multiple dependencies (#560)
lewismc
[nutch] branch master updated: NUTCH-2836 Upgrade various commons dependencies (#559)
lewismc
[nutch] branch master updated: Add possibility to setup deduplication group mode in crawl script (#557)
lewismc
[nutch] branch master updated: NUTCH-2835 Upgrade commons-jexl from 2 --> 3 (#558)
lewismc
[nutch] branch master updated: NUTCH-2833 Upgrade to Tika 1.25
snagel
[nutch] branch master updated: NUTCH-2582 Set pool size of XML SAX parsers used for MIME detection in Tika - add method in MimeUtil to set MimeTypesReader pool size - actually adjust pool size to number of Fetcher threads / 2 (minimum pool size is 10 in case there are less than 20 Fetcher threads) - double pool size (10 -> 20) of Tika XMLReaderUtils in tika-config.xml
snagel
[nutch] branch master updated: NUTCH-2809 Upgrade any23 plugin dependency to 2.4 (#553)
lewismc
[nutch] branch master updated: NUTCH-2829 Fix ant target "clean-cache" - make target "clean-cache" depend on "ivy-init" so that ivy-related resources are defined
snagel
Subject: Important update for nutch-comm...@lucene.apache.org: Please see transcript for details.
Email Gateway Security
[nutch] branch master updated (0b46ac2 -> 680df6b)
snagel
[nutch] branch master updated: NUTCH-2823 IllegalStateException in IndexWriters.describe() when validating url param for SolrIndexer
snagel
[nutch] branch master updated: NUTCH-2818 Fix Apache Rat task to check sources for license headers - automatize download of Apache Rat jar file - write report to build/apache-rat-report.txt
snagel
[nutch] branch master updated: NUTCH-2814 HttpDateFormat's internal time zone may change after parsing a date - reset time zone to GMT after parsing a date
snagel
[nutch] branch master updated: NUTCH-2697 Upgrade Ivy to fix the issue of an unset packaging.type property NUTCH-2671 Upgrade ant ivy library - upgrade Ivy (2.4.0 -> 2.5.0) - upgrade all plugins build-ivy.xml to use the ivy jar 2.5.0 installed in $NUTCH_HOME/ivy/ for preparing lists of dependencies registered in plugin.xml
snagel
[nutch] branch master updated (466cac5 -> ae844b6)
snagel
[nutch] 34/35: NUTCH-2817 Avoid check for equality of URL path and file part using ==/!= - replace check whether URL path and file are identical by check whether URL has a query - clean up code and improve log messages
snagel
[nutch] 12/35: NUTCH-2720 ROBOTS metatag ignored when capitalized - move string "robots" to constant in metadata.Nutch - make string lowercase not depend on system locale
snagel
[nutch] 33/35: NUTCH-2816 Add Spotbugs target to ant build - called on-demand as ant target "spotbugs" - creates spotbugs report ("build/nutch-spotbugs.html") covering Nutch core and plugins
snagel
[nutch] 09/35: NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file
snagel
[nutch] 08/35: NUTCH-1945 Test for XLSX parser - add Tika unit test for XLSX files - bundle instance variables and utility methods in class TikaParserTest - clean up javadoc comments
snagel
[nutch] 22/35: [NUTCH-2796] Upgrade to crawler-commons 1.1
snagel
[nutch] 21/35: Prepare for new development after release of 1.17 - bump version number (1.17-SNAPSHOT -> 1.18-SNAPSHOT) - add 1.17 changes / release notes - update links to Hadoop and Solr API docs - update current year in API docs etc.
snagel
[nutch] 32/35: NUTCH-2811 : Setup Github workflows for prs (#543)
snagel
[nutch] 03/35: NUTCH-1194 Generator: CrawlDB lock should be released earlier - release CrawlDb lock after select step, in case, generated items are not marked in CrawlDb (generate.update.crawldb is false)
snagel
[nutch] 25/35: NUTCH-2805: Rename plugin urlfilter-domainblacklist (#540)
snagel
[nutch] 17/35: NUTCH-2789 Docker README: update links to point to cwiki
snagel
[nutch] 24/35: NUTCH-2782: protocol-http / lib-http: support TLSv1.3
snagel
[nutch] 14/35: NUTCH-2790 indexer-csv: escape field leading quote character
snagel
[nutch] 01/35: NUTCH-2743 Add list of Nutch properties (nutch-default.xml) to documentation - modify ant build.xml to copy nutch-default.xml into docs/api/resources/ - adapt XSLT table layout - remove obsolete nutch-conf.xsl - fix typos and normalize spelling in nutch-default.xml
snagel
[nutch] 16/35: NUTCH-2788 ParseData: improve presentation of Metadata in method toString() - switch to multi-line presentation of Metadata in ParseData::toString - default implementation of Metadata::toString is still single-line - replace StringBuffer by StringBuilder in modified methods
snagel
[nutch] 10/35: NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file
snagel
[nutch] 28/35: NUTCH-1190 MoreIndexingFilter: move data formats used to parse "lastModified" to a config file
snagel
[nutch] 29/35: [NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back - if no agent names are given as command-line arguments use values of http.agent.name and http.robots.agents as agent names to be checked - update command-line help
snagel
[nutch] 05/35: NUTCH-2002 parse and index checkers to check robots.txt - applied Julien's patch to recent code base - also check redirects whether they are allowed - add command-line parameter `-checkRobotsTxt` enabling this check
snagel
[nutch] 20/35: NUTCH-2794 Add additional ciphers to HTTP base's default cipher suite
snagel
[nutch] 18/35: NUTCH-2789 Documentation: update links to point to cwiki
snagel
[nutch] 06/35: NUTCH-2753 Add -listen option to command-line help of CrawlDbReader and LinkDbReader
snagel
[nutch] 15/35: NUTCH-2787 CrawlDb JSON dump does not export metadata primitive data types correctly - add JsonSerializer to write common Writable types (null, boolean, numbers) - remaining "unknown" Writables are written after calling toString()
snagel
[nutch] 31/35: NUTCH-2810 FreeGenerator to actually apply configured number of fetch lists
snagel
[nutch] 23/35: [NUTCH-2730] SitemapProcessor to treat sitemap URLs as Set instead of List - sitemap links from robots.txt are treated as set by crawler-commons (since crawler-commons 1.1) - sitemaps referenced in sitemap index are deduplicated
snagel
[nutch] 02/35: NUTCH-2434 Add methods to reset parameters HTMLMetaTags (apply patch contributed by Markus)
snagel
[nutch] 07/35: NUTCH-2758 Add plugin READMEs to binary release packages
snagel
[nutch] 27/35: NUTCH-2799 Add .asf.yaml file - update pull request template regarding Jira linking: issue id should be in square brackets (`[NUTCH-XXXX]`)
snagel
[nutch] 26/35: NUTCH-2799 Add .asf.yaml file - add project description in one sentence - add github topics - set github mailing list notifications as configured before
snagel
[nutch] 30/35: [NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back - clarify comment regarding bypassing the confidence check for a non-empty http.agent.name
snagel
[nutch] 13/35: NUTCH-2496 Speed up link inversion step in crawling script
snagel
[nutch] 19/35: NUTCH-2791 Handle GCS URLs in stats commands
snagel
[nutch] 11/35: NUTCH-2720 ROBOTS metatag ignored when capitalized
snagel
[nutch] 35/35: Merge branch 'derhecht-patch-2', closes #545
snagel
[nutch] 04/35: NUTCH-2785 FreeGenerator: command-line option to define number of generated fetch lists - add command-line option `-numFetchers` to FreeGenerator - in local mode: generate one single fetch list
snagel
[nutch] branch master updated: NUTCH-2817 Avoid check for equality of URL path and file part using ==/!= - replace check whether URL path and file are identical by check whether URL has a query - clean up code and improve log messages
snagel
[nutch] branch master updated: NUTCH-2816 Add Spotbugs target to ant build - called on-demand as ant target "spotbugs" - creates spotbugs report ("build/nutch-spotbugs.html") covering Nutch core and plugins
snagel
[nutch] branch master updated: NUTCH-2810 FreeGenerator to actually apply configured number of fetch lists
snagel
[nutch] branch master updated (e33aaa1 -> 2f5a8ad)
snagel
svn commit: r1063816 - /websites/production/nutch/content/
snagel
svn commit: r1880552 - /nutch/cms_site/trunk/content/credits.md
snagel
svn commit: r1063815 - in /websites/staging/nutch/trunk/content: ./ credits.html
buildbot
svn commit: r1063814 - in /websites/staging/nutch/trunk/content: ./ credits.html
buildbot
svn commit: r1880551 - /nutch/cms_site/trunk/content/credits.md
snagel
svn commit: r1063813 - in /websites/staging/nutch/trunk/content: ./ assets/js/README.html bot.html credits.html downloads.html index.html javadoc.html mailing_lists.html version_control.html
buildbot
svn commit: r1880550 - in /nutch/cms_site/trunk: content/credits.md templates/std.html
snagel
[nutch] branch master updated (5399ce0 -> e33aaa1)
balakuntala
[nutch] branch master updated (f0161ea -> 5399ce0)
snagel
[nutch] branch master updated: NUTCH-2805: Rename plugin urlfilter-domainblacklist (#540)
balakuntala
[nutch] branch master updated: NUTCH-2782: protocol-http / lib-http: support TLSv1.3
snagel
[nutch] branch master updated (a1adce7 -> ff15671)
snagel
svn commit: r1062491 - /websites/production/nutch/content/
snagel
svn commit: r1879450 - in /nutch/cms_site/trunk: content/assets/css/bootstrap.css templates/std.html
snagel
svn commit: r1062490 - in /websites/staging/nutch/trunk/content: ./ assets/css/bootstrap.css assets/js/README.html bot.html credits.html downloads.html index.html javadoc.html mailing_lists.html version_control.html
buildbot
svn commit: r40271 - /release/nutch/1.16/
snagel
svn commit: r1062489 - /websites/production/nutch/content/
snagel
svn commit: r1062488 - in /websites/staging/nutch/trunk/content: ./ version_control.html
buildbot
svn commit: r1879446 - /nutch/cms_site/trunk/content/version_control.md
snagel
svn commit: r1062485 - in /websites/staging/nutch/trunk/content: ./ version_control.html
buildbot
svn commit: r1879444 - /nutch/cms_site/trunk/content/version_control.md
snagel
svn commit: r1062484 - in /websites/staging/nutch/trunk/content: ./ assets/js/README.html bot.html credits.html downloads.html index.html javadoc.html mailing_lists.html version_control.html
buildbot
svn commit: r1879443 - in /nutch/cms_site/trunk: content/version_control.md templates/std.html
snagel
svn commit: r1879442 - in /nutch/cms_site/trunk/content: bot.md credits.md downloads.md index.md javadoc.md mailing_lists.md
snagel
svn commit: r1062483 - in /websites/staging/nutch/trunk/content: ./ bot.html credits.html downloads.html javadoc.html mailing_lists.html
buildbot
svn commit: r1879440 - in /nutch/cms_site/trunk: content/assets/css/bootstrap.css templates/std.html
snagel
svn commit: r1062481 - in /websites/staging/nutch/trunk/content: ./ assets/css/bootstrap.css assets/js/README.html bot.html credits.html downloads.html index.html javadoc.html mailing_lists.html version_control.html
buildbot
svn commit: r1062480 - in /websites/staging/nutch/trunk/content: ./ assets/css/ assets/img/ assets/js/
buildbot
svn commit: r1879439 - in /nutch/cms_site/trunk: content/assets/css/bootstrap.css content/assets/img/nutch_logo_tm.gif content/assets/img/nutch_logo_tm.png content/index.md content/javadoc.md templates/std.html
snagel
svn commit: r1062474 - in /websites/staging/nutch/trunk/content: ./ apidocs/apidocs-1.17/ apidocs/apidocs-1.17/org/ apidocs/apidocs-1.17/org/apache/ apidocs/apidocs-1.17/org/apache/nutch/ apidocs/apidocs-1.17/org/apache/nutch/analysis/ apidocs/apidocs-...
buildbot
svn commit: r1879431 - in /nutch/cms_site/trunk/content: ./ apidocs/apidocs-1.17/ apidocs/apidocs-1.17/org/ apidocs/apidocs-1.17/org/apache/ apidocs/apidocs-1.17/org/apache/nutch/ apidocs/apidocs-1.17/org/apache/nutch/analysis/ apidocs/apidocs-1.17/org...
snagel
[nutch] branch master updated: Prepare for new development after release of 1.17 - bump version number (1.17-SNAPSHOT -> 1.18-SNAPSHOT) - add 1.17 changes / release notes - update links to Hadoop and Solr API docs - update current year in API docs etc.
snagel
svn commit: r40263 - /dev/nutch/1.17/ /release/nutch/1.17/
snagel
svn commit: r40079 [1/3] - /dev/nutch/1.17/
snagel
svn commit: r40079 [3/3] - /dev/nutch/1.17/
snagel
svn commit: r40079 [2/3] - /dev/nutch/1.17/
snagel
[nutch] annotated tag release-1.17 updated (e68bd87 -> eff98db)
snagel
[nutch] 02/02: Nutch 1.17 release - update current year in API docs etc. - update version number - add changes / release notes
snagel
[nutch] 01/02: Nutch 1.17 release - update links to Hadoop and Solr API docs
snagel
[nutch] branch branch-1.17 updated (77fa56e -> 1386c5a)
snagel
[nutch] annotated tag release-1.17 updated (77fa56e -> e68bd87)
snagel
[nutch] branch branch-1.17 created (now 77fa56e)
snagel
[nutch] 02/02: Nutch 1.16 release - update current year in API docs etc. - update version number - add changes / release notes
snagel
[nutch] 01/02: Nutch 1.16 release - update links to Hadoop and Solr API docs
snagel
[nutch] branch master updated: NUTCH-2794 Add additional ciphers to HTTP base's default cipher suite
markus
[nutch] branch master updated: NUTCH-2791 Handle GCS URLs in stats commands
snagel
[nutch] branch master updated: NUTCH-2790 indexer-csv: escape field leading quote character
snagel
[nutch] branch master updated: NUTCH-2496 Speed up link inversion step in crawling script
snagel
[nutch] branch master updated (9139d6e -> 1cb64df)
snagel
[nutch] branch master updated (e61a8a3 -> 9139d6e)
snagel
[nutch] branch master updated: NUTCH-1945 Test for XLSX parser - add Tika unit test for XLSX files - bundle instance variables and utility methods in class TikaParserTest - clean up javadoc comments
snagel
[nutch] branch master updated: NUTCH-2758 Add plugin READMEs to binary release packages
snagel
[nutch] branch master updated: NUTCH-2753 Add -listen option to command-line help of CrawlDbReader and LinkDbReader
snagel
[nutch] branch master updated: NUTCH-2002 parse and index checkers to check robots.txt - applied Julien's patch to recent code base - also check redirects whether they are allowed - add command-line parameter `-checkRobotsTxt` enabling this check
snagel
[nutch] branch master updated: NUTCH-2785 FreeGenerator: command-line option to define number of generated fetch lists - add command-line option `-numFetchers` to FreeGenerator - in local mode: generate one single fetch list
snagel
[nutch] branch master updated: NUTCH-1194 Generator: CrawlDB lock should be released earlier - release CrawlDb lock after select step, in case, generated items are not marked in CrawlDb (generate.update.crawldb is false)
snagel
[nutch] branch master updated: NUTCH-2434 Add methods to reset parameters HTMLMetaTags (apply patch contributed by Markus)
snagel
[nutch] branch master updated: NUTCH-2743 Add list of Nutch properties (nutch-default.xml) to documentation - modify ant build.xml to copy nutch-default.xml into docs/api/resources/ - adapt XSLT table layout - remove obsolete nutch-conf.xsl - fix typos and normalize spelling in nutch-default.xml
snagel
[nutch] branch master updated: NUTCH-2784 Tool to list Nutch properties and configured values
snagel
[nutch] branch master updated: NUTCH-2495: Use -deleteGone instead of clean job in crawl script while indexing
snagel
[nutch] branch master updated: NUTCH-2776 Fetcher to temporarily deduplicate followed redirects - cache followed redirect targets for a configurable time (`fetcher.redirect.dedupcache.seconds`) - if a redirect target is found in cache it's skipped
snagel
[nutch] branch master updated: NUTCH-2772 Debugging parse filter to show serialized DOM tree
snagel
[nutch] branch master updated: NUTCH-2778 indexer-elastic to properly log errors - add log output in BulkProcessor.Listener - do not throw an exception in BulkProcessor.Listener (ignored anyway)
snagel
[nutch] branch master updated: NUTCH-2501 allow to set Java heap size when using crawl script in distributed mode - fix examples of `-D property=value` in bin/crawl : there must be a blank after `-D` because these arguments are first parsed by bin/crawl
snagel
[nutch] branch master updated: NUTCH-2501 allow to set Java heap size when using crawl script in distributed mode - bin/crawl - add hint how to set map and reduce task memory via -D ... options - use -D options for all steps (Nutch tools), fixes NUTCH-2379 - fix quoting of -D options, eg. -D plugin.includes='protocol-xyz|parse-xyz' - use -D options for all steps (Nutch tools) - bin/nutch - document that environment variables are only used in local mode
snagel
[nutch] branch master updated: NUTCH-2781 Increase default Java heap size - increase default value for NUTCH_HEAPSIZE to 4096 MB (from 1000 MB) - remove -Dmapred.child.java.opts=-Xmx1000m from default options in bin/crawl
snagel
[nutch] branch master updated: NUTCH-2783 Use (more) parametrized logging - replace logging messages with string concatenations by parametrized calls - remove LOG.isInfoEnabled() where parametrized logging is used and no or minor extra calls are done to get logging parameters (similar for other log levels) - replace needless .toString() and Integer.toString(intVal)
snagel
[nutch] branch master updated (49eb1bd -> 52eec66)
snagel
[nutch] branch master updated: NUTCH-2779 Upgrade to Tika 1.24.1
snagel
[nutch] branch master updated (6f51618 -> dcbb0f2)
snagel
[nutch] branch master updated (0cd0022 -> 6f51618)
snagel
[nutch] branch master updated: NUTCH-2777 - Upgrade to Hadoop 3.1
snagel
[nutch] branch master updated: NUTCH-2775 Fetcher to guarantee minimum delay even if robots.txt defines shorter Crawl-delay - guaranteed minimum delay is configured by `fetcher.min.crawl.delay` (default set equal to `fetcher.server.delay`)
snagel
[nutch] branch master updated: NUTCH-2773 SegmentReader (-dump or -get): show HTML content as UTF-8 - if called with command-line flag `-recode` (or if property `segment.reader.content.recode` is true): try to recode the HTML page content to UTF-8 using the already detected charset - fix passing forward properties (-Dprop=val) to Hadoop job/tasks * always use same Hadoop Configuration * use single instance of SegmentReader for -get and -list * remove duplicating member and local variables
snagel
[nutch] branch master updated: NUTCH-2774 Annotate methods implementing the Hadoop API by @Override - annotate classes implementing Hadoop interfaces - annotate few classes implementing Nutch interfaces - remove empty method implementations when super classes already provide a default implementation
snagel
[nutch] branch master updated (ebc2152 -> 4443cc1)
snagel
[nutch] branch master updated: NUTCH-2763 protocol-okhttp (store.http.headers): add whitespace in status line after status code also when message is empty
snagel
[nutch] branch master updated: NUTCH-2768 FetcherThread: unnecessary usage of class casts
snagel
[nutch] branch master updated (142a026 -> ac4f2f4)
snagel
[nutch] branch master updated: NUTCH-2762 Replace http:// URLs by https:// (build files and documentation) - change URLs of linked resources in build files and documentation from http:// to https:// where possible - update resource locations where necessary
snagel
[nutch] branch master updated (0a2ffa7 -> ea862f4)
snagel
[nutch] branch master updated (a118c85 -> 0a2ffa7)
snagel
[nutch] branch master updated: NUTCH-2759 bin/crawl: Rename option --num-slaves - renamed to --num-fetchers
snagel
Earlier messages
Later messages