Repository: nutch
Updated Branches:
  refs/heads/master 240d7f8e1 -> 506540da3


Prep for Nutch 1.12 release


Project: http://git-wip-us.apache.org/repos/asf/nutch/repo
Commit: http://git-wip-us.apache.org/repos/asf/nutch/commit/506540da
Tree: http://git-wip-us.apache.org/repos/asf/nutch/tree/506540da
Diff: http://git-wip-us.apache.org/repos/asf/nutch/diff/506540da

Branch: refs/heads/master
Commit: 506540da36c24e92e7cb40cc99215a487f5882b0
Parents: 240d7f8
Author: Lewis John McGibbney <[email protected]>
Authored: Sat May 28 13:02:42 2016 -0700
Committer: Lewis John McGibbney <[email protected]>
Committed: Sat May 28 13:02:42 2016 -0700

----------------------------------------------------------------------
 CHANGES.txt | 158 +++++++++++++++++++++----------------------------------
 1 file changed, 61 insertions(+), 97 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/nutch/blob/506540da/CHANGES.txt
----------------------------------------------------------------------
diff --git a/CHANGES.txt b/CHANGES.txt
index fb5f544..454ee58 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,3 +1,7 @@
+Nutch Change Log
+
+Comments
+
 Fellow committers, Nutch 1.12 contains a breaking change NUTCH-2220. Please 
use the note below and
 in the release announcement and keep it on top in this CHANGES.txt for the 
Nutch 1.12 release.
 
@@ -8,103 +12,63 @@ in the release announcement and keep it on top in this 
CHANGES.txt for the Nutch
 * db.ignore.internal.links and db.ignore.external.links now operate on the 
CrawlDB only
 * linkdb.ignore.internal.links and linkdb.ignore.external.links now operate on 
the LinkDB only
 
-Nutch Change Log
-
-* NUTCH-2248 CSS Parser Plugin (Joseph Naegele via mattmann)
-
-* NUTCH-2252 Allow phantomjs as a browser for selenium options (Kim Whitehall 
via mattmann)
-
-* GitHub-106 Option to include inlinks in commonscrawl dump (thammegowda via 
mattmann)
-
-* NUTCH-2256 Inconsistent log level (songwanging via snagel)
-
-* NUTCH-2254 Indexer: character set issue with -addBinaryContent and -base64 
(Federico Bonelli, snagel)
-
-* NUTCH-2250 CommonCrawlDumper : Invalid format and skipped parts (Thamme 
Gowda N.,lewismc via mattmann)
-
-* NUTCH-2245 Developed the NGram Model on the existing Unigram Cosine 
Similarity Model (bhavyasanghavi via sujen)
-
-* NUTCH-2191 Add HtmlUnit plugin in Nutch. (karanjeets and markus17 via 
mattmann)
-
-* NUTCH-2241 Unstable Selenium plugin in Nutch. Fixed bugs and enhanced 
configuration (karanjeets via mattmann)
-
-* NUTCH-2213 CommonCrawlDataDumper saves gzipped body in extracted form 
(jnioche via mattmann)
-
-* NUTCH-2144 Added an extension point and a plugin to accept external links 
(Thamme Gowda N. via mattmann)
-
-* NUTCH-1712 Use MultipleInputs in Injector to make it a single mapreduce job 
(tejasp, snagel)
-
-* NUTCH-2231 Jexl support in generator job (markus)
-
-* NUTCH-2232 DeduplicationJob should decode URL's before length is compared 
(Ron van der Vegt via markus)
-
-* NUTCH-2229 Allow Jexl expressions on CrawlDatum's fixed attributes (markus)
-
-* NUTCH-2227 RegexParseFilter (markus)
-
-* NUTCH-2221 Introduce db.ignore.internal.links to FetcherThread (markus)
-
-* NUTCH-2220 Rename db.* options used only by the linkdb to linkdb.* (markus)
-
-* NUTCH-2228 Plugin index-replace unit test broken on Java 8 (snagel via 
markus)
-
-* NUTCH-2219 Criteria order to be configurable in DeduplicationJob (Ron van 
der Vegt via markus)
-
-* NUTCH-2218 Update CrawlComplete util to use Commons CLI (Joyce)
-
-* NUTCH-2223 Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika 
mimetype detection (Tien Nguyen Manh via markus)
-
-* NUTCH-2224 Average bytes/second calculated incorrectly in fetcher (Tien 
Nguyen Manh via markus)
-
-* NUTCH-2225 Parsed time calculated incorrectly (Tien Nguyen Manh via markus)
-
-* NUTCH-961 Expose Tika's Boilerpipe support (Gabriele Kahlout, Vincent Slot, 
markus)
-
-* NUTCH-1233 Rely on Tika for outlink extraction (markus)
-
-* NUTCH-2210 Upgrade to Tika 1.12 (markus)
-
-* NUTCH-2209 Improved Tokenization for Similarity Scoring plugin (Sujen)
-
-* NUTCH-2211 Added filterchecker and normalizerchecker to bin/nutch script 
(markus)
-
-* NUTCH-2197 Add Solr 5 cloud indexer support (Jurian Broertjes via markus)
-
-* NUTCH-2206 Provide example scoring.similarity.stopword.file (sujen)
-
-* NUTCH-2204 Remove junit lib from runtime (snagel)
-
-* NUTCH-2201 Remove loops program from webgraph package (markus)
-
-* NUTCH-1325 HostDB for Nutch (Gui Forget, markus, tejasp)
-
-* NUTCH-2203 Suffix URL filter can't handle trailing/leading whitespaces 
(Jurian Broertjes via markus)
-
-* NUTCH-2194 Run IndexingFilterChecker as simple Telnet server (markus)
-
-* NUTCH-2196 IndexingFilterChecker to optionally normalize (markus)
-
-* NUTCH-2195 IndexingFilterChecker to optionally follow N redirects (markus)
-
-* NUTCH-2190 Protocol normalizer (markus)
-
-* NUTCH-1838 Host and domain based regex and automaton filtering (markus)
-
-* NUTCH-2178 DeduplicationJob to optionally group on host or domain (markus)
-
-* NUTCH-1449 Optionally delete documents skipped by IndexingFilters (markus)
-
-* NUTCH-2189 Domain filter must deactivate if no rules are present (markus)
-
-* NUTCH-2182 Make reverseUrlDirs file dumper option hash the URL for 
consistency (joyce)
-
-* NUTCH-2183 Improvement to SegmentChecker for skipping non-segments present 
in segments directory (lewismc)
-
-* NUTCH-2180 FileDumper skips Corrupt Segments (Harshavardhan Manjunatha via 
lewismc)
-
-* NUTCH-2042 parse-html increase chunk size used to detect charset (snagel)
-
-* NUTCH-2172 index-more: document format of contenttype-mapping.txt (Nicola 
Tonellotto, snagel)
+Sub-task
+
+    [NUTCH-2250] - CommonCrawlDumper : Invalid format + skipped parts
+
+Bug
+
+    [NUTCH-2042] - parse-html increase chunk size used to detect charset
+    [NUTCH-2180] - FileDumper dumps data, but breaks midway on corrupt segments
+    [NUTCH-2189] - Domain filter must deactivate if no rules are present
+    [NUTCH-2203] - Suffix URL filter can't handle trailing/leading whitespaces
+    [NUTCH-2206] - Provide example scoring.similarity.stopword.file
+    [NUTCH-2213] - CommonCrawlDataDumper saves gzipped body in extracted form
+    [NUTCH-2223] - Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika 
mimetype detection
+    [NUTCH-2224] - Average bytes/second calculated incorrectly in fetcher
+    [NUTCH-2225] - Parsed time calculated incorrectly
+    [NUTCH-2228] - Plugin index-replace unit test broken on Java 8
+    [NUTCH-2232] - DeduplicationJob should decode URL's before length is 
compared
+    [NUTCH-2241] - Unstable Selenium plugin in Nutch. Fixed bugs and enhanced 
configuration
+    [NUTCH-2256] - Inconsistent log level practice
+
+Improvement
+
+    [NUTCH-1233] - Rely on Tika for outlink extraction
+    [NUTCH-1712] - Use MultipleInputs in Injector to make it a single 
mapreduce job
+    [NUTCH-2172] - index-more: document format of contenttype-mapping.txt
+    [NUTCH-2178] - DeduplicationJob to optionally group on host or domain
+    [NUTCH-2182] - Make reverseUrlDirs file dumper option hash the URL for 
consistency
+    [NUTCH-2183] - Improvement to SegmentChecker for skipping non-segments 
present in segments directory
+    [NUTCH-2187] - Change FileDumper SHAs to all uppercase
+    [NUTCH-2195] - IndexingFilterChecker to optionally follow N redirects
+    [NUTCH-2196] - IndexingFilterChecker to optionally normalize
+    [NUTCH-2197] - Add solr5 solrcloud indexer support
+    [NUTCH-2204] - Remove junit lib from runtime
+    [NUTCH-2218] - Switch CrawlCompletion arg parsing to Commons CLI
+    [NUTCH-2221] - Introduce db.ignore.internal.links to FetcherThread
+    [NUTCH-2229] - Allow Jexl expressions on CrawlDatum's fixed attributes
+    [NUTCH-2231] - Jexl support in generator job
+    [NUTCH-2252] - Allow phantomjs as a browser for selenium options
+    [NUTCH-2263] - Support for mingram and maxgram at Unigram Cosine 
Similarity Model
+
+New Feature
+
+    [NUTCH-961] - Expose Tika's boilerpipe support
+    [NUTCH-1325] - HostDB for Nutch
+    [NUTCH-2144] - Plugin to override db.ignore.external to exempt interesting 
external domain URLs
+    [NUTCH-2190] - Protocol normalizer
+    [NUTCH-2191] - Add protocol-htmlunit
+    [NUTCH-2194] - Run IndexingFilterChecker as simple Telnet server
+    [NUTCH-2219] - Criteria order to be configurable in DeduplicationJob
+    [NUTCH-2227] - RegexParseFilter
+    [NUTCH-2245] - Developed the NGram Model on the existing Unigram Cosine 
Similarity Model
+
+Task
+
+    [NUTCH-2201] - Remove loops program from webgraph package
+    [NUTCH-2211] - Filter and normalizer checkers missing in bin/nutch
+    [NUTCH-2220] - Rename db.* options used only by the linkdb to linkdb.*
 
 Nutch 1.11 Release 03/12/2015 (dd/mm/yyyy)
 Release Report: http://s.apache.org/nutch11

Reply via email to