CHANGES.txt

snagel Thu, 09 Aug 2018 02:53:38 -0700

Added: release/nutch/1.15/CHANGES.txt
==============================================================================
--- release/nutch/1.15/CHANGES.txt (added)
+++ release/nutch/1.15/CHANGES.txt Thu Aug  9 09:53:23 2018
@@ -0,0 +1,2887 @@
+# Nutch Change Log
+
+Nutch 1.15 Release (25/07/2018)
+Release Report: https://s.apache.org/nczS
+
+Breaking Changes
+
+    - indexer plugins are now configured in a single XML file 
(conf/index-writers.xml),
+      see https://wiki.apache.org/nutch/IndexWriters - setting or overwriting 
configuration
+      parameters via Nutch properties is not possible anymore.
+
+Bug
+
+    [NUTCH-1993] - Nutch does not use backup parsers
+    [NUTCH-2071] - A parser failure on a single document may fail crawling job 
if parser.timeout=-1
+    [NUTCH-2145] - parse/index checker fail to fetch valid percent-encoded URLs
+    [NUTCH-2161] - Interrupted failed and/or killed tasks fail to clean up 
temp directories in HDFS
+    [NUTCH-2273] - Selenium and InteractiveSelenium Do Not Support HTTPS
+    [NUTCH-2310] - Protocol-Selenium does not support HTTPS protocol
+    [NUTCH-2321] - Indexing filter checker leaks threads
+    [NUTCH-2324] - Issue in setting default linkdb path
+    [NUTCH-2447] - Work-around SSLProtocolException: handshake alert: 
unrecognized_name
+    [NUTCH-2454] - REST API fix for usage of hostdb in generator
+    [NUTCH-2461] - Generate passes the data to when maxCount == 0
+    [NUTCH-2466] - Sitemap processor to follow redirects
+    [NUTCH-2467] - Sitemap type field can be null
+    [NUTCH-2485] - ParserFactory swallows exception
+    [NUTCH-2486] - Compiler Warning: Unchecked / unsafe operations in 
MimeTypeIndexingFilter
+    [NUTCH-2489] - Dependency collision with lucene-analyzers-common in 
scoring-similarity plugin
+    [NUTCH-2490] - Sitemap processing: Sitemap index files not working
+    [NUTCH-2494] - Fetcher: java.lang.IllegalArgumentException: Wrong FS: s3
+    [NUTCH-2499] - Elastic REST Indexer: Duplicate values
+    [NUTCH-2505] - nutch does not delete the .locked file, when the generator 
partition got an exception
+    [NUTCH-2508] - Misleading documentation about http.proxy.exception.list
+    [NUTCH-2509] - Inconsistent behavior in SitemapProcessor
+    [NUTCH-2513] - ant eclipse target fails with "protocol switch unsafe"
+    [NUTCH-2517] - mergesegs corrupts segment data
+    [NUTCH-2518] - Must check return value of job.waitForCompletion()
+    [NUTCH-2520] - Wrong Accept-Charset sent when http.accept.charset is not 
defined
+    [NUTCH-2521] - SitemapProcessor to use property sitemap.redir.max
+    [NUTCH-2523] - UpdateHostDB blocks usage of plugins unintentionally
+    [NUTCH-2524] - bin/crawl: fix check for HostDb in distributed mode
+    [NUTCH-2533] - Injector: NullPointerException if seed URL dir contains 
non-file entries
+    [NUTCH-2535] - CrawlDbReader -stats: ClassCastException
+    [NUTCH-2544] - Nutch 1.15 no longer compatible with AWS EMR and S3
+    [NUTCH-2547] - urlnormalizer-basic fails on special characters in 
path/query
+    [NUTCH-2549] - protocol-http does not behave the same as browsers
+    [NUTCH-2550] - Fetcher fails to follow redirects
+    [NUTCH-2551] - NullPointerException in generator
+    [NUTCH-2552] - CrawlDbReader -topN fails
+    [NUTCH-2553] - Fetcher not to modify URLs to be fetched
+    [NUTCH-2554] - parserchecker can't fetch some URLs
+    [NUTCH-2565] - MergeDB incorrectly handles unfetched CrawlDatums
+    [NUTCH-2568] - Caught exception is immediately rethrown
+    [NUTCH-2569] - ClassNotFoundException when running in (pseudo-)distributed 
mode
+    [NUTCH-2570] - Deduplication job fails to install deduplicated CrawlDb
+    [NUTCH-2571] - SegmentReader -list fails to read segment
+    [NUTCH-2572] - HostDb: updatehostdb does not set values
+    [NUTCH-2574] - Generator: hostCount >= maxCount comparison wrong
+    [NUTCH-2581] - Caching of redirected robots.txt may overwrite correct 
robots.txt rules
+    [NUTCH-2589] - HTML redirections are not followed when using parse-tika
+    [NUTCH-2590] - SegmentReader -get fails
+    [NUTCH-2592] - Fetcher to log reason of failed fetches
+    [NUTCH-2593] - Single mode doesn't work in RabbitMQ indexer
+    [NUTCH-2597] - NPE in updatehostdb
+    [NUTCH-2601] - Elasticsearch Rest and Amazon CloudSearch have the same 
implementation class in indexer-writers.xml
+    [NUTCH-2607] - ParserChecker should call 
ScoringFilters.passScoreAfterParsing() on all parses
+    [NUTCH-2609] - urlnormalizer-basic to normalize path of file: URLs
+    [NUTCH-2614] - NPE in CrawlDbReader -stats on empty CrawlDb
+    [NUTCH-2616] - Review routing of deletions by Exchange component
+    [NUTCH-2618] - protocol-okhttp not to use http.timeout for max duration to 
fetch document
+    [NUTCH-2620] - urlfilter-validator incorrectly assumes that top-level 
domains are not longer than 4 characters
+    [NUTCH-2624] - protocol-okhttp resource leak
+
+New Feature
+
+    [NUTCH-1129] - Any23 Nutch plugin
+    [NUTCH-1541] - Indexer plugin to write CSV
+    [NUTCH-2412] - Exchange component for indexing job
+    [NUTCH-2492] - Add more configuration parameters to crawl script
+
+Improvement
+
+    [NUTCH-1106] - Options to skip url's based on length
+    [NUTCH-1480] - SolrIndexer to write to multiple servers.
+    [NUTCH-2012] - Merge parsechecker and indexchecker
+    [NUTCH-2375] - Upgrade the code base from org.apache.hadoop.mapred to 
org.apache.hadoop.mapreduce
+    [NUTCH-2390] - No documentation on pluggable indexing
+    [NUTCH-2411] - Index-metadata to support indexing multiple values for a 
field
+    [NUTCH-2416] - Fetcher to log thread ID
+    [NUTCH-2432] - Protocol httpclient to disable cookies if 
http.enable.cookie.header is false
+    [NUTCH-2441] - ARG_SEGMENT usage
+    [NUTCH-2491] - Integrate sitemap processing and HostDB into crawl script
+    [NUTCH-2493] - Add configuration parameter for sitemap processing to 
crawler script
+    [NUTCH-2497] - Elastic REST Indexer: Allow multiple hosts
+    [NUTCH-2502] - Any23 Plugin: Add Content-Type filtering
+    [NUTCH-2503] - Add option to run tests for a single plugin
+    [NUTCH-2510] - Crawl script modification. HostDb : generate, optional 
usage and description
+    [NUTCH-2516] - Hadoop imports use wildcards
+    [NUTCH-2519] - Log mapreduce job counters in local mode
+    [NUTCH-2526] - NPE in scoring-opic when indexing document without CrawlDb 
datum
+    [NUTCH-2527] - URL filter: provide rules to exclude localhost and private 
address spaces
+    [NUTCH-2530] - Rename property db.max.anchor.length > 
linkdb.max.anchor.length
+    [NUTCH-2534] - CrawlDbReader -stats: make score quantiles configurable
+    [NUTCH-2539] - Not correct naming of db.url.filters and db.url.normalizers 
in nutch-default.xml
+    [NUTCH-2543] - readdb & readlinkdb to implement AbstractChecker
+    [NUTCH-2545] - Upgrade to Any23 2.2
+    [NUTCH-2566] - Fix exception log messages
+    [NUTCH-2576] - HTTP protocol plugin based on okhttp
+    [NUTCH-2577] - protocol-selenium can't handle https
+    [NUTCH-2578] - Avoid lock by MimeUtil in constructor of protocol.Content
+    [NUTCH-2579] - Fetcher to use parsed URL to call 
ProtocolFactory.getProtocol(url)
+    [NUTCH-2580] - Improvements for Rabbitmq support
+    [NUTCH-2583] - Upgrading Nutch's dependencies
+    [NUTCH-2584] - Upgrade parse-tika to use Tika 1.18
+    [NUTCH-2594] - Documentation for indexer plugins
+    [NUTCH-2595] - Upgrade crawler-commons dependency to 0.10
+    [NUTCH-2600] - Refactoring indexer-solr
+    [NUTCH-2611] - Add line-breaks when parsing HTML block-level elements
+    [NUTCH-2617] - Disable Exchange component by default
+    [NUTCH-2619] - protocol-okhttp: allow to keep partially fetched docs as 
truncated
+
+Task
+
+    [NUTCH-1219] - Upgrade all jobs to new MapReduce API
+    [NUTCH-1228] - Change mapred.task.timeout to mapreduce.task.timeout in 
fetcher
+
+Sub-task
+
+    [NUTCH-1223] - Migrate WebGraph to MapReduce API
+    [NUTCH-1224] - Migrate FreeGenerator to MapReduce API
+    [NUTCH-1226] - Migrate CrawlDbReader to MapReduce API
+    [NUTCH-2152] - CommonCrawl dump via Service endpoint
+    [NUTCH-2555] - URL normalization problem: path not starting with a '/'
+    [NUTCH-2556] - protocol-http makes invalid HTTP/1.0 requests
+    [NUTCH-2557] - protocol-http fails to follow redirections when an HTTP 
response body is invalid
+    [NUTCH-2558] - protocol-http cannot handle a missing HTTP status line
+    [NUTCH-2559] - protocol-http cannot handle colons after the HTTP status 
code
+    [NUTCH-2560] - protocol-http throws an error when an http header spans 
over multiple lines
+    [NUTCH-2561] - protocol-http can be made to read arbitrarily large HTTP 
responses
+    [NUTCH-2562] - protocol-http fails to read large chunked HTTP responses
+    [NUTCH-2563] - HTTP header spellchecking issues
+    [NUTCH-2575] - protocol-http does not respect the maximum content-size for 
chunked responses
+    [NUTCH-2622] - Unbundle LGPL-licensed jars from binary release
+
+
+Nutch 1.14 Release 18/12/2017 (dd/mm/yyyy)
+
+    - the bin/crawl script now expects the path to the seed to be preceded by 
-s  (NUTCH-2046)
+
+Bug
+
+    [NUTCH-2071] - A parser failure on a single document may fail crawling job
+    [NUTCH-2235] - Classpath discrepancy with protocol-selenium in deploy mode
+    [NUTCH-2269] - Clean not working after crawl
+    [NUTCH-2295] - Nutch master docker container broken
+    [NUTCH-2297] - CrawlDbReader -stats wrong values for earliest fetch time 
and shortest interval
+    [NUTCH-2316] - Library conflict with Parser-Tika Plugin and Lib Folder
+    [NUTCH-2317] - Plugin jars don't get added to classpath while running in 
local
+    [NUTCH-2322] - URL not available for Jexl operations
+    [NUTCH-2354] - Upgrade Hadoop dependencies to 2.7.4
+    [NUTCH-2365] - HTTP Redirects to SubDomains don't get crawled if 
db.ignore.external.links.mode == byDomain
+    [NUTCH-2371] - Injector to support noFilter and noNormalize
+    [NUTCH-2372] - Javadocs build failing.
+    [NUTCH-2386] - BasicURLNormalizer does not encode curly braces
+    [NUTCH-2391] - Spurious Duplications for MD5
+    [NUTCH-2394] - Possible bugs in the source code
+    [NUTCH-2398] - Fetcher saving redirected robots.txt under redirect target 
URL
+    [NUTCH-2399] - indexer-elastic does not index multi-value fields (only the 
first value is indexed)
+    [NUTCH-2401] - headings plugin does not trim values
+    [NUTCH-2403] - Nutch Selenium: Wrong documentation about PhantomJS
+    [NUTCH-2413] - Parsing fetcher to respect property "parse.filter.urls"
+    [NUTCH-2420] - Bug in variable generate.max.count and fetcher.server.delay
+    [NUTCH-2436] - Remove empty comment, and redundant semicolon from 
CommandRunner
+    [NUTCH-2442] - Injector to stop if job fails to avoid loss of CrawlDb
+    [NUTCH-2444] - HostDB CSV dumper to emit field header by default
+    [NUTCH-2446] - URLFiltersCheck fix
+    [NUTCH-2448] - Allow Sending an empty http.agent.version
+    [NUTCH-2451] - protocol-ftp to resolve relative URL when following 
redirects
+    [NUTCH-2452] - Problem retrieving encoded URLs via FTP?
+    [NUTCH-2456] - Allow to index pages/URLs not contained in CrawlDb
+    [NUTCH-2458] - TikaParser doesn't work with tika-config.xml set
+    [NUTCH-2464] - Plugin headings: Headers That Contain HTML Elements Are Not 
Parsed
+    [NUTCH-2465] - Broken Eclipse project. Classpaths and interactiveselenium 
should be fixed.
+    [NUTCH-2472] - Sitemap processor does not honour db.ignore.external.links
+    [NUTCH-2473] - Elasticsearch REST Indexer broken due to wrong depenency
+    [NUTCH-2474] - CrawlDbReader -stats fails with ClassCastException
+    [NUTCH-2478] - // is not a valid base URL
+    [NUTCH-2483] - Remove/replace indirect dependencies to org.json
+
+Improvement
+
+    [NUTCH-1763] - Improving comments on the Injector Class
+    [NUTCH-2034] - CrawlDB filtered documents counter.
+    [NUTCH-2035] - Regex filter using case sensitive rules.
+    [NUTCH-2046] - The crawl script should be able to skip an initial 
injection.
+    [NUTCH-2135] - Ant Eclipse build does not include 
protocol-interactiveselenium
+    [NUTCH-2193] - Upgrade feed parser plugin to use rome 1.5
+    [NUTCH-2216] - db.ignore.*.links to optionally follow internal redirects
+    [NUTCH-2281] - Support non-default FileSystem
+    [NUTCH-2296] - Elasticsearch Indexing Over Rest
+    [NUTCH-2320] - URLFilterChecker to run as TCP Telnet service
+    [NUTCH-2335] - Injector not to filter and normalize existing URLs in 
CrawlDb
+    [NUTCH-2362] - Upgrade MaxMind GeoIP version in index-geoip
+    [NUTCH-2368] - Variable generate.max.count and fetcher.server.delay
+    [NUTCH-2370] - FileDumper: save JSON mapping file -> URL
+    [NUTCH-2376] - Improve configurability of HTTP Accept* header fields
+    [NUTCH-2378] - ChildFirst plugin classloader
+    [NUTCH-2380] - indexer-elastic version upgrade to 5.3.0
+    [NUTCH-2397] - Parser to add paragraph line breaks
+    [NUTCH-2400] - Solr 6.6.0 compatibility
+    [NUTCH-2406] - Sum up constants, make minor changes
+    [NUTCH-2408] - CrawlDb: allow update from unparsed segments
+    [NUTCH-2409] - Injector: complete command-line help and counters
+    [NUTCH-2414] - Allow LanguageIndexingFilter to actually filter documents 
by language.
+    [NUTCH-2430] - Complete plugin build configuration
+    [NUTCH-2431] - URLFilterchecker to implement Tool-interface
+    [NUTCH-2439] - Upgrade to Apache Tika 1.17
+    [NUTCH-2443] - Extract links from the video tag with the parse-html plugin
+    [NUTCH-2445] - Fetcher following outlinks to keep track of already fetched 
items
+    [NUTCH-2463] - Enable sampling CrawlDB
+    [NUTCH-2468] - should filter out invalid URLs by default
+    [NUTCH-2470] - CrawlDbReader -stats to show quantiles of score
+    [NUTCH-2477] - Refactor *Checker classes to use base class for common code
+    [NUTCH-2480] - Upgrade crawler-commons dependency to 0.9
+
+New Feature
+
+    [NUTCH-1465] - Support sitemaps in Nutch
+    [NUTCH-1932] - Automatically remove orphaned pages
+    [NUTCH-2333] - Indexer for RabbitMQ
+    [NUTCH-2338] - URLNormalizerChecker to run as TCP Telnet service
+    [NUTCH-2415] - Create a JEXL based IndexingFilter
+    [NUTCH-2433] - Html Parser: keep htmltag where the outlinks are found
+    [NUTCH-2435] - New configuration allowing to choose whether to store 
'parse_text' directory or not.
+    [NUTCH-2484] - Extend indexer-elastic-rest to support languages
+
+Task
+
+    [NUTCH-2181] - Add Webpage for 3rd Party Connectors/Libraries to Apache 
Nutch
+
+
+Nutch 1.13 Release 28/03/2017 (dd/mm/yyyy)
+Release Report: https://s.apache.org/wq3x
+
+Sub-task
+
+    [NUTCH-2246] - Refactor /seed endpoint for backward compatibility
+
+Bug
+
+    [NUTCH-1553] - Property 'indexer.delete.robots.noindex' not working when 
using parser-html.
+    [NUTCH-2242] - lastModified not always set
+    [NUTCH-2291] - Fix mrunit dependencies
+    [NUTCH-2337] - urlnormalizer-basic to strip empty port
+    [NUTCH-2345] - FetchItemQueue logs are logged with wrong class name
+    [NUTCH-2349] - urlnormalizer-basic NPE for ill-formed URL "http:/"
+    [NUTCH-2357] - Index metadata throw Exception because writable object 
cannot be cast to Text
+    [NUTCH-2359] - Parsefilter-regex raises IndexOutOfBoundsException when 
rules are ill-formed
+    [NUTCH-2364] - http.agent.rotate: IllegalArgumentException / last element 
of agent names ignored
+    [NUTCH-2366] - Deprecated Job constructor in hostdb/ReadHostDb.java
+
+Improvement
+
+    [NUTCH-1308] - Add main() to ZipParser
+    [NUTCH-2164] - Inconsistent 'Modified Time' in crawl db
+    [NUTCH-2234] - Upgrade to elasticsearch 2.3.3
+    [NUTCH-2236] - Upgrade to Hadoop 2.7.2
+    [NUTCH-2262] - Utilize parameterized logging notation across Fetcher
+    [NUTCH-2272] - Index checker server to optionally keep client connection 
open
+    [NUTCH-2286] - CrawlDbReader -stats to show fetch time and interval
+    [NUTCH-2287] - Indexer-elastic plugin should use Elasticsearch 
BulkProcessor and BackoffPolicy
+    [NUTCH-2299] - Remove obsolete properties protocol.plugin.check.*
+    [NUTCH-2300] - Fetcher to optionally save robots.txt
+    [NUTCH-2327] - Seeds injected in REST workflow must be ingested into HDFS
+    [NUTCH-2329] - Update Slf4j logging for Java 8 and upgrade miredot plugin 
version
+    [NUTCH-2336] - SegmentReader to implement Tool
+    [NUTCH-2352] - Log with Generic Class Name at Nutch 1.x
+    [NUTCH-2355] - Protocol plugins to set cookie if Cookie metadata field is 
present
+    [NUTCH-2367] - Get single record from HostDB
+
+New Feature
+
+    [NUTCH-2132] - Publisher/Subscriber model for Nutch to emit events
+
+Task
+
+    [NUTCH-2171] - Upgrade Nutch Trunk to Java 1.8
+
+ 
+Nutch 1.12 Release 28/05/2016 (dd/mm/yyyy)
+Release Report: https://s.apache.org/nutch1.12
+
+Comments
+
+Fellow committers, Nutch 1.12 contains a breaking change NUTCH-2220. Please 
use the note below and
+in the release announcement and keep it on top in this CHANGES.txt for the 
Nutch 1.12 release.
+
+* replace your old conf/nutch-default.xml with the conf/nutch-default.xml from 
Nutch 1.12 release
+* if you use LinkDB (e.g. invertlinks) and modified parameters db.max.inlinks 
and/or db.max.anchor.length
+  and/or db.ignore.internal.links, rename those parameters to 
linkdb.max.inlinks and
+  linkdb.max.anchor.length and linkdb.ignore.internal.links
+* db.ignore.internal.links and db.ignore.external.links now operate on the 
CrawlDB only
+* linkdb.ignore.internal.links and linkdb.ignore.external.links now operate on 
the LinkDB only
+
+Sub-task
+
+    [NUTCH-2250] - CommonCrawlDumper : Invalid format + skipped parts
+
+Bug
+
+    [NUTCH-2042] - parse-html increase chunk size used to detect charset
+    [NUTCH-2180] - FileDumper dumps data, but breaks midway on corrupt segments
+    [NUTCH-2189] - Domain filter must deactivate if no rules are present
+    [NUTCH-2203] - Suffix URL filter can't handle trailing/leading whitespaces
+    [NUTCH-2206] - Provide example scoring.similarity.stopword.file
+    [NUTCH-2213] - CommonCrawlDataDumper saves gzipped body in extracted form
+    [NUTCH-2223] - Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika 
mimetype detection
+    [NUTCH-2224] - Average bytes/second calculated incorrectly in fetcher
+    [NUTCH-2225] - Parsed time calculated incorrectly
+    [NUTCH-2228] - Plugin index-replace unit test broken on Java 8
+    [NUTCH-2232] - DeduplicationJob should decode URL's before length is 
compared
+    [NUTCH-2241] - Unstable Selenium plugin in Nutch. Fixed bugs and enhanced 
configuration
+    [NUTCH-2256] - Inconsistent log level practice
+
+Improvement
+
+    [NUTCH-1233] - Rely on Tika for outlink extraction
+    [NUTCH-1712] - Use MultipleInputs in Injector to make it a single 
mapreduce job
+    [NUTCH-2172] - index-more: document format of contenttype-mapping.txt
+    [NUTCH-2178] - DeduplicationJob to optionally group on host or domain
+    [NUTCH-2182] - Make reverseUrlDirs file dumper option hash the URL for 
consistency
+    [NUTCH-2183] - Improvement to SegmentChecker for skipping non-segments 
present in segments directory
+    [NUTCH-2187] - Change FileDumper SHAs to all uppercase
+    [NUTCH-2195] - IndexingFilterChecker to optionally follow N redirects
+    [NUTCH-2196] - IndexingFilterChecker to optionally normalize
+    [NUTCH-2197] - Add solr5 solrcloud indexer support
+    [NUTCH-2204] - Remove junit lib from runtime
+    [NUTCH-2218] - Switch CrawlCompletion arg parsing to Commons CLI
+    [NUTCH-2221] - Introduce db.ignore.internal.links to FetcherThread
+    [NUTCH-2229] - Allow Jexl expressions on CrawlDatum's fixed attributes
+    [NUTCH-2231] - Jexl support in generator job
+    [NUTCH-2252] - Allow phantomjs as a browser for selenium options
+    [NUTCH-2263] - Support for mingram and maxgram at Unigram Cosine 
Similarity Model
+
+New Feature
+
+    [NUTCH-961] - Expose Tika's boilerpipe support
+    [NUTCH-1325] - HostDB for Nutch
+    [NUTCH-2144] - Plugin to override db.ignore.external to exempt interesting 
external domain URLs
+    [NUTCH-2190] - Protocol normalizer
+    [NUTCH-2191] - Add protocol-htmlunit
+    [NUTCH-2194] - Run IndexingFilterChecker as simple Telnet server
+    [NUTCH-2219] - Criteria order to be configurable in DeduplicationJob
+    [NUTCH-2227] - RegexParseFilter
+    [NUTCH-2245] - Developed the NGram Model on the existing Unigram Cosine 
Similarity Model
+
+Task
+
+    [NUTCH-2201] - Remove loops program from webgraph package
+    [NUTCH-2211] - Filter and normalizer checkers missing in bin/nutch
+    [NUTCH-2220] - Rename db.* options used only by the linkdb to linkdb.*
+
+Nutch 1.11 Release 03/12/2015 (dd/mm/yyyy)
+Release Report: http://s.apache.org/nutch11
+
+* NUTCH-2176 Clean up of log4j.properties (markus)
+
+* NUTCH-2107 plugin.xml to validate against plugin.dtd (snagel)
+
+* NUTCH-2177 Generator produces only one partition even in distributed mode 
(jnioche, snagel)
+
+* NUTCH-2158 Upgrade to Tika 1.11 (jnioche, snagel)
+
+* NUTCH-2175 Typos in property descriptions in nutch-default.xml (Roannel 
FernÃ¡ndez HernÃ¡ndez via snagel)
+
+* NUTCH-2069 Ignore external links based on domain (jnioche)
+
+* NUTCH-2173 String.join in FileDumper breaks the build (joyce)
+
+* NUTCH-2166 Add reverse URL format to dump tool (joyce)
+
+* NUTCH-2157 Addressing Miredot REST API Warnings (Sujen Shah)
+
+* NUTCH-2165 FileDumper Util hard codes part-# folder name (joyce)
+
+* NUTCH-2167 Backport TableUtil from 2.x for URL reversing (joyce)
+
+* NUTCH-2160 Upgrade Selenium Java to 2.48.2 (lewismc, kwhitehall)
+
+* NUTCH-2120 Remove MapWritable from trunk codebase (lewismc)
+
+* NUTCH-1911 Improve DomainStatistics tool command line parsing (joyce)
+
+* NUTCH-2064 URLNormalizer basic to encode reserved chars and decode 
non-reserved chars (markus, snagel)
+
+* NUTCH-2159 Ensure that all WebApp files are copied into generated artifacts 
for 1.X Webapp (lewismc)
+
+* NUTCH-2154 Nutch REST API (DB) suffering NullPointerException (Aron Ahmadia, 
Sujen Shah via mattmann)
+
+* NUTCH-2150 Add protocolstats utility (Michael Joyce via mattmann)
+
+* NUTCH-2146 hashCode on the Outlink class (jorgelbg via mattmann)
+
+* NUTCH-2155 Create a "crawl completeness" utility (Michael Joyce via mattmann)
+
+* NUTCH-1988 Make nested output directory dump optional... again (Michael 
Joyce via lewismc)
+
+* NUTCH-1800 Documentation for Nutch 1.X and 2.X REST APIs (lewismc)
+
+* NUTCH-2149 REST endpoint to read Nutch sequence files (Sujen Shah)
+
+* NUTCH-2139 Basic plugin to index inlinks and outlinks (jorgelbg)
+
+* NUTCH-2128 Review and update mapred --> mapreduce config params in crawl 
script (lewismc)
+
+* NUTCH-2141 Change the InteractiveSelenium plugin handler Interface to return 
page content
+  (Balaji Gurumurthy via mattmann)
+
+* NUTCH-2129 Add protocol status tracking to crawl datum (Michael Joyce via 
mattmann)
+
+* NUTCH-2142 Nutch File Dump - FileNotFoundException (Invalid Argument) Error 
(Karanjeet Singh via mattmann)
+
+* NUTCH-2136 Implement a different version of Naive Bayes Parse Filter 
(Asitang Mishra)
+
+* NUTCH-2109 Create a brute force click-all-ajax-links utility fucntion for 
selenium interactive plugin (Asitang Mishra)
+
+* NUTCH-2108 Add a function to the selenium interactive plugin interface to do 
multiple manipulation of driver and then return the data (Asitang Mishra)
+
+* NUTCH-2124 Fetcher following same redirect again and again (Yogendra Kumar 
Soni via snagel)
+
+* NUTCH-2123 Seed List REST API returns Text but headers indicate/require JSON
+  (Aron Ahmadia, Sujen Shah via mattmann)
+
+* NUTCH-2086 Nutch 1.X Webui (Sujen Shah, mattmann via lewismc)
+
+* NUTCH-2121 Update javadoc link for Hadoop 2.4.0 in default.properties (Sujen 
Shah)
+
+* NUTCH-2119 Eclipse shows build path errors on building Nutch (Sujen Shah)
+
+* NUTCH-2117 NutchServer CLI Option for CMD_PORT is incorrect and should be 
CMD_HOST (zhangmianhongni via lewismc)
+
+* NUTCH-2115 - Add total counts to mimetype stats (Jimmy Joyce via lewismc)
+
+* NUTCH-2111 Delete temporary files location for selenium tmp files after 
driver quits (Kim Whitehall via lewismc)
+
+* NUTCH-2095 WARC exporter for the CommonCrawlDataDumper (jorgelbg)
+
+* NUTCH-2102 WARC Exporter (jnioche)
+
+* NUTCH-2106 Runtime to contain Selenium and dependencies only once (snagel)
+
+* NUTCH-2104 Add documentation to the protocol-selenium plugin Readme file 
+  re: selenium grid implementation (Kim Whitehall via mattmann)
+
+* NUTCH-2099 Refactoring the REST endpoints for integration with 
+  webui (Sujen Shah via mattmann)
+
+* NUTCH-2098 Add null SeedUrl constructor (Aron Ahmadia via mattmann)
+
+* NUTCH-2093 Indexing filters to use current signatures (markus)
+
+* NUTCH-2092: Unit Test for NutchServer (Sujen Shah via mattmann)
+
+* NUTCH-2096 Explicitly indicate broswer binary to use when selecting 
+  selenium remote option in config (Kim Whitehall via mattmann)
+
+* NUTCH-2090 Refactor Seed Resource in REST API (Sujen Shah
+  via mattmann)
+
+* NUTCH-2088 Add URL Processing Check to Interactive Selenium 
+  Handlers (Michael Joyce via mattmann)
+
+* NUTCH-2077 Upgrade to Tika 1.10 (Michael Joyce via lewismc)
+
+* NUTCH-1517 CloudSearch indexer (jnioche)
+
+* NUTCH-2085 Upgrade Guava (markus)
+
+* NUTCH-2084 SegmentMerger to report missing input dirs (markus)
+
+* NUTCH-2083 Implement functionality to shadow nutch-selenium-grid-plugin from 
Mo Omer (lewismc)
+
+* NUTCH-2049 Upgrade to Hadoop 2.4 (lewismc)
+
+* NUTCH-1486 Upgrade to Solr 4.10.2 (lewismc, markus)
+
+* NUTCH-2048 parse-tika: fix dependencies in plugin.xml (Michael Joyce via 
snagel)
+
+* NUTCH-2066 Parameterize Generate REST endpoint (Sujen Shah via mattmann)
+
+* NUTCH-2072 Deflate encoding support is broken when http.content.limit is set 
to -1 (Tanguy Moal via mattmann)
+
+* NUTCH-2062 Add Plugin for interacting with Selenium WebDriver (Michael 
Joyce, mattmann)
+
+* NUTCH-1785 Ability to index raw content (markus, lewismc)
+
+* NUTCH-2063 Add -mimeStats flag to FileDumper tool (Mike Joyce via lewismc)
+
+* NUTCH-2021 Use protocol-selenium to Capture Screenshots of the Page as it is 
Fetched (lewismc)
+
+* NUTCH-2058 Indexer plugin that allows RegEx replacements on the 
NutchDocument 
+  field values (Peter Ciuffetti via mattmann)
+
+* NUTCH-2059 protocol-httpclient, protocol-http unit test errors on Jenkins 
(Peter Ciuffetti via mattmann)
+
+* NUTCH-1980 Jexl expressions for CrawlDbReader (markus)
+
+* NUTCH-1692 SegmentReader was broken in distributed mode (markus, tejasp)
+
+* NUTCH-1684 ParseMeta to be added before fetch schedulers are run (markus)
+
+* NUTCH-2038 fix for NUTCH-2038: Naive Bayes classifier based html Parse 
filter (for filtering outlinks) 
+  (Asitang Mishra, snagel via mattmann)
+
+* NUTCH-2041 indexer fails if linkdb is missing (snagel)
+
+* NUTCH-2016 Remove unused class OldFetcher (snagel)
+
+* NUTCH-2000 Link inversion fails with .locked already exists (jnioche, snagel)
+
+* NUTCH-2036 Adding some continuous crawl goodies to the crawl script (jorge, 
snagel)
+
+* NUTCH-2039 Relevance based scoring filter (Sujen Shah, lewismc via mattmann)
+
+* NUTCH-2037 Job endpoint to support Indexing from the REST API (Sujen Shah 
via mattmann)
+
+* NUTCH-2017 Remove debug log from MimeUtil (snagel)
+
+* NUTCH-2027 seed list REST endpoint for Nutch 1.10 (Asitang Mishra via 
mattmann)
+
+* NUTCH-2031 Create Admin End point for Nutch 1.x REST service (Sujen Shah via 
mattmann)
+
+* NUTCH-2015 Make FetchNodeDb optional (off by default) if NutchServer is not 
used (Sujen Shah via mattmann)
+
+* NUTCH-208 http: proxy exception list: (Matthias GÃ¼nter, siren, markus, 
lewismc)
+
+* NUTCH-2007 add test libs to classpath of bin/nutch junit (snagel)
+
+* NUTCH-1995 Add support for wildcard to http.robot.rules.whitelist (totaro)
+
+* NUTCH-2013 Fetcher: missing logs "fetching ..." on stdout (snagel)
+
+* NUTCH-2014 Fetcher hang-up on completion (snagel)
+
+* NUTCH-2011 Endpoint to support realtime JSON output from the fetcher (Sujen 
Shah via mattmann)
+
+* NUTCH-2006 IndexingFiltersChecker to take custom metadata as input (jnioche)
+
+* NUTCH-2008 IndexerMapReduce to use single instance of NutchIndexAction for 
deletions (snagel)
+
+* NUTCH-1998 Add support for user-defined file extension to 
CommonCrawlDataDumper (totaro via mattmann)
+
+* NUTCH-1873 Solr IndexWriter/Job to report number of docs indexed. (snagel 
via lewismc)
+ 
+* NUTCH-1934 Refactor Fetcher in trunk (lewismc)
+
+* NUTCH-2004 ParseChecker does not handle redirects (mjoyce via lewismc)
+
+Nutch 1.10 Release - 29/04/2015 (dd/mm/yyyy)
+Release Report: http://s.apache.org/nutch10
+
+* NUTCH-1969 URL Normalizer properly handling slashes (markus via mattmann)
+
+* NUTCH-2001 Sub Collection Field Name incorrect in nutch-default.xml 
+  (Jeff Cocking via mattmann)
+
+* NUTCH-1997 Add CBOR "magic header" to CommonCrawlDataDumper 
+  output (Giuseppe Totaro, Luke Sh via mattmann)
+
+* NUTCH-1991 Tika mime detection not using Nutch supplied tika-mimetypes.xml 
for content based 
+  detection (Iain Lopata, snagel via mattmann)
+
+* NUTCH-1994 Upgrade to Apache Tika 1.8 (lewismc)
+
+* NUTCH-1996 Make protocol-selenium README part of plugin (lewismc)
+
+* NUTCH-1990 Use URI.normalise() in BasicURLNormalizer (snagel, jnioche)
+
+* NUTCH-1973 Job Administration end point for the REST service (Sujen Shah via 
mattmann)
+
+* NUTCH-1697 SegmentMerger to implement Tool (markus, snagel)
+
+* NUTCH-1987 - Make bin/crawl indexer agnostic (Michael Joyce, snagel via 
mattmann)
+ 
+* NUTCH-1989 Handling invalid URLs in CommonCrawlDataDumper (Giuseppe Totaro 
via mattmann)
+
+* NUTCH-1988 Make nested output directory dump optional (Michael Joyce via 
mattmann)
+
+* NUTCH-1927 Create a whitelist of IPs/hostnames to allow skipping of 
RobotRules parsing (mattmann, snagel)
+
+* NUTCH-1986 Clarify Elastic Search Indexer Plugin Settings (Michael Joyce via 
mattmann)
+
+* NUTCH-1906 Typo in CrawlDbReader command line help (Michael Joyce via 
mattmann)
+
+* NUTCH-1911 Improve DomainStatistics tool command line parsing (Michael Joyce 
via mattmann)
+
+* NUTCH-1854 bin/crawl fails with a parsing fetcher (Asitang Mishra via snagel)
+
+* NUTCH-1981 Upgrade to icu4j 55.1 (Marko Asplund via snagel)
+
+* NUTCH-1960 JUnit test for dump method of CommonCrawlDataDumper (Giuseppe 
Totaro via mattmann)
+
+* NUTCH-1983 CommonCrawlDumper and FileDumper don't dump correct JSON 
(mattmann)
+
+* NUTCH-1972 Dockerfile for Nutch 1.x (Michael Joyce via mattmann)
+
+* NUTCH-1771 Indexer fails if a segment is corrupted or incomplete (Diaa, 
Chong Li via snagel) 
+
+* NUTCH-1975 New configuration for CommonCrawlDataDumper tool (Giuseppe Totaro 
via mattmann)
+
+* NUTCH-1979 CrawlDbReader to implement Tool (markus)
+
+* NUTCH-1970 Pretty print JSON output in config resource (Tyler Pasulich, 
mattmann)
+
+* NUTCH-1976 Allow Users to Set Hostname for Server (Tyler Palsulich via 
mattmann)
+
+* NUTCH-1941 Optional rolling http.agent.name's (Asitang Mishra, lewismc via 
snagel)
+
+* NUTCH-1959 Improving CommonCrawlFormat implementations (Giuseppe Totaro via 
mattmann)
+
+* NUTCH-1974 keyPrefix option for CommonCrawlDataDumper tool (Giuseppe Totaro 
via mattmann)
+
+* NUTCH-1968 File Name too long issue of DumpFileUtil.java file (Xin Zhang, 
Renxia Wang via mattmann)
+
+* NUTCH-1966 Configuration endpoint for 1x REST API (Sujen Shah via mattmann)
+
+* NUTCH-1967 Possible SIooBE in MimeAdaptiveFetchSchedule (markus)
+
+* NUTCH-1957 FileDumper output file name collisions (Renxia Wang via mattmann)
+
+* NUTCH-1955 ByteWritable missing in NutchWritable (markus)
+
+* NUTCH-1956 Members to be public in URLCrawlDatum (markus)
+ 
+* NUTCH-1954 FilenameTooLong error appears in CommonCrawlDumper (mattmann)
+
+* NUTCH-1949 Dump out the Nutch data into the Common Crawl format (Giuseppe 
Totaro via lewismc)
+
+* NUTCH-1950 File name too long (Jiaheng Zhang, Chong Li via mattmann)
+
+* NUTCH-1921 Optionally disable HTTP if-modified-since header (markus)
+
+* NUTCH-1933 nutch-selenium plugin (Mo Omer, Mohammad Al-Moshin, lewismc)
+
+* NUTCH-827 HTTP POST Authentication (Jasper van Veghel, yuanyun.cn, snagel, 
lewismc)
+
+* NUTCH-1724 LinkDBReader to support regex output filtering (markus)
+
+* NUTCH-1939 Fetcher fails to follow redirects (Leo Ye via snagel)
+
+* NUTCH-1913 LinkDB to implement db.ignore.external.links (markus, snagel)
+
+* NUTCH-1925 Upgrade to Apache Tika 1.7 (Tyler Palsulich via markus)
+
+* NUTCH-1323 AjaxNormalizer (markus)
+
+* NUTCH-1918 TikaParser specifies a default namespace when generating DOM 
(jnioche)
+
+* NUTCH-1889 Store all values from Tika metadata in Nutch metadata (jnioche)
+
+* NUTCH-865 Format source code in unique style (lewismc)
+
+* NUTCH-1893 Parse-tika failes to parse feed files (Mengying Wang via snagel)
+
+* NUTCH-1920 Upgrade Nutch to use Java 1.7 (lewismc)
+
+* NUTCH-1919 Getting timeout when server returns Content-Length: 0 (jnioche)
+
+* NUTCH-1912 Dump tool -mimetype parameter needs to be optional to prevent NPE 
(Tyler Palsulich via lewismc)
+
+* NUTCH-1881 ant target resolve-default to keep test libs (snagel)
+
+* NUTCH-1660 Index filter for Page's latitude and longitude (Yasin KÄ±lÄ±nÃ§, 
lewismc)
+
+* NUTCH-1140 index-more plugin, resetTitle creates multiple values in title 
field (Joe Liedtke, kaveh minooie via snagel)
+
+* NUTCH-1904 Schema for Solr4 doesn't include _version_ field (mattmann)
+
+* NUTCH-1897 Easier debugging of plugin XML errors (markus)
+
+* NUTCH-1823 Upgrade to elasticsearch 1.4.1 (Phu Kieu, markus via lewismc)
+
+* NUTCH-1592 TikaParser can uppercase the element names while generating the 
DOM (jnioche)
+
+* NUTCH-1877 Suffix URL filter to ignore query string by default (markus via 
snagel)
+
+* NUTCH-1890 Major Typo in Documentation for Integrating Nutch and Solr (Boadu 
Akoto Charles Jnr, mattmann)
+
+* NUTCH-1887 Specify HTMLMapper to use in TikaParser (jnioche)
+
+* NUTCH-1884 NullPointerException in parsechecker and indexchecker with 
symlinks in file URL (Mengying Wang, snagel)
+
+* NUTCH-1825 protocol-http may hang for certain web pages (Phu Kieu via snagel)
+
+* NUTCH-1483 Can't crawl filesystem with protocol-file plugin (RogÃ©rio 
Pereira AraÃºjo, Mengying Wang, snagel)
+
+* NUTCH-1885 Protocol-file should treat symbolic links as redirects (Mengying 
Wang, snagel)
+
+* NUTCH-1880 URLUtil should not add additional slashes for file URLs (snagel)
+
+* NUTCH-1879 Regex URL normalizer should remove multiple slashes after file: 
protocol (snagel)
+
+* NUTCH-1883 bin/crawl: use function to run bin/nutch and check exit value 
(snagel)
+
+* NUTCH-1865 Enable use of SNAPSHOT's with Nutch Ivy dependency management 
(lewismc)
+
+* NUTCH-1882 ant eclipse target to add output path to src/test (snagel)
+
+* NUTCH-1876 Upgrade to Crawler Commons 0.5 (jnioche)
+
+* NUTCH-1874 FileDumper comment typos ( Arthur Cinader via lewismc)
+
+* NUTCH-1164 Write JUnit tests for protocol-http (nimafl via snagel)
+
+* NUTCH-1868 Document and improve CLI for FileDumper tool (lewismc)
+
+* NUTCH-1869 Add a flag to -mimeType fiag to FileDumper (lewismc)
+
+* NUTCH-1867 CrawlDbReader: use setFloat to pass min score (lewismc, snagel)
+
+* NUTCH-1826, NUTCH-1864 indexchecker fails if solr.server.url not configured 
(lewismc, snagel)
+
+* NUTCH-1866 ant eclipse target should not delete runtime (nimafl via lewismc)
+
+* NUTCH-1857 readb -dump -format csv should use comma (lewismc)
+
+* NUTCH-1853 Add commented out WebGraph executions to ./bin/crawl (lewismc)
+
+* NUTCH-1844 testresources/testcrawl not referenced anywhere in code (mattmann)
+
+* NUTCH-1839 Improve WebGraph CLI parsing (lewismc)
+
+* NUTCH-1526 Create SegmentContentDumperTool for easily extracting out file 
contents from SegmentDirs (mattmann, lewismc, Julien Le Dem)
+
+* NUTCH-1840 the describe function in SolrIndexWriter is not correct (kaveh 
minooie via jnioche)
+
+* NUTCH-1837 Upgrade to Tika 1.6 (jnioche)
+
+* NUTCH-1829 Generator : unable to distinguish real errors (Mathieu Bouchard 
via jnioche)
+ 
+* NUTCH-1835 Nutch's Solr schema doesn't work with Solr 4.9 because of the 
RealTimeGet handler (mattmann)
+
+* NUTCH-1833 Include version number within nutch binary usage statement (Rishi 
Verma via mattmann)
+
+* NUTCH-1832 Make Nutch work without an indexer (mattmann)
+
+* NUTCH-1828 bin/crawl : incorrect handling of nutch errors (Mathieu Bouchard 
via jnioche)
+
+* NUTCH-1775 IndexingFilter: document origin of passed CrawlDatum (snagel)
+
+* NUTCH-1693 TextMD5Signature computed on textual content (Tien Nguyen Manh, 
markus via snagel)
+
+* NUTCH-1409 remove deprecated properties db.{default,max}.fetch.interval, 
generate.max.per.host.by.ip (Matthias Agethle via snagel)
+
+Nutch 1.9 Release Change Log - 12/08/2014 (dd/mm/yyyy)
+Release Report - http://s.apache.org/1.9-release
+
+* NUTCH-1561 improve usability of parse-metatags and index-metadata (snagel)
+
+* NUTCH-1708 use same id when indexing and deleting redirects (snagel)
+
+* NUTCH-1818 Add deps-test-compile task for building plugins (jnioche)
+
+* NUTCH-1817 Remove pom.xml from source (jnioche)
+
+* NUTCH-926 Redirections from META tag don't get filtered (snagel)
+
+* NUTCH-1422 Bypass signature comparison when a document is redirected (snagel)
+
+* NUTCH-1502 Test for CrawlDatum state transitions (snagel)
+
+* NUTCH-1804 Move JUnit dependency to test scope (jnioche)
+
+* NUTCH-1811 bin/nutch junit to use junit 4 test runner (snagel)
+
+* NUTCH-1799 ANT Eclipse task discovers all plugin jars automatically (jnioche)
+
+* NUTCH-578 URL fetched with 403 is generated over and over again (snagel)
+
+* NUTCH-1776 Log incorrect plugin.folder file path (Diaa via snagel)
+
+* NUTCH-1566 bin/nutch to allow whitespace in paths (tejasp, snagel)
+
+* NUTCH-1605 MIME type detector recognizes xlsx as zip file (snagel)
+
+* NUTCH-1802 Move TestbedProxy to test environment (jnioche)
+
+* NUTCH-1803 Put test dependencies in a separate lib dir (jnioche)
+
+* NUTCH-385 Improve description of thread related configuration for Fetcher 
(jnioche,lufeng)
+
+* NUTCH-1633 slf4j is provided by hadoop and should not be included in the job 
file (kaveh minooie via jnioche)
+
+* NUTCH-1787 update and complete API doc overview page (snagel)
+
+* NUTCH-1767 remove special treatment of "params" in relative links (snagel)
+
+* NUTCH-1718 redefine http.robots.agent as "additional agent names" (snagel, 
Tejas Patil, Daniel Kugel)
+
+* NUTCH-1794 IndexingFilterChecker to optionally dumpText (markus)
+
+* NUTCH-1590 [SECURITY] Frame injection vulnerability in published Javadoc 
(jnioche)
+
+* NUTCH-1793 HttpRobotRulesParser not configured properly (jnioche)
+
+* NUTCH-1647 protocol-http throws 'unzipBestEffort returned null' for 
redirected pages (jnioche)
+
+* NUTCH-1736 Can't fetch page if http response header contains 
Transfer-Encodingï¼chunked (ysc via jnioche)
+
+* NUTCH-1782 NodeWalker to return current node (markus)
+
+* NUTCH-1758 IndexChecker to send document to IndexWriters (jnioche)
+
+* NUTCH-1786 CrawlDb should follow db.url.normalizers and db.url.filters (Diaa 
via markus)
+
+* NUTCH-1757 ParserChecker to take custom metadata as input (jnioche)
+
+* NUTCH-1676 Add rudimentary SSL support to protocol-http (jnioche, markus)
+
+* NUTCH-1772 Injector does not need merging if no pre-existing crawldb 
(jnioche)
+
+* NUTCH-1752 Cache robots.txt rules per protocol:host:port (snagel)
+
+* NUTCH-1613 Timeouts in protocol-httpclient when crawling same host with >2 
threads (brian44 via jnioche)
+
+* NUTCH-1766 Generator to unlock crawldb and remove tempdir if generate job 
fails (Diaa via jnioche)
+
+* NUTCH-207 Bandwidth target for fetcher rather than a thread count (jnioche)
+
+* NUTCH-1182 fetcher to log hung threads (snagel)
+
+* NUTCH-1759 Upgrade to Crawler Commons 0.4 (jnioche)
+
+* NUTCH-1764 readdb to show command-line help if no action (-stats, -dump, 
etc.) given (Diaa via snagel)
+
+* NUTCH-1700 Remove deprecated code from creativecommons plugin (lewismc)
+
+* NUTCH-1761 Crawl script fails to find job file if not started from inside 
bin dir (David Hosking, jnioche)
+
+* NUTCH-1603 ZIP parser complains about truncated PDF file (snagel)
+
+* NUTCH-1720 Duplicate lines in HttpBase.java (Walter Tietze via jnioche)
+
+* NUTCH-1750 Improvement of Fetcher's reportStatus (jnioche)
+
+* NUTCH-1747 Use AtomicInteger as semaphore in Fetcher (jnioche)
+
+* NUTCH-1735 code dedup fetcher queue redirects (snagel)
+
+* NUTCH-1745 Upgrade to ElasticSearch 1.1.0 (jnioche)
+
+* NUTCH-1645 Junit Test Case for Adaptive Fetch Schedule class (Yasin 
KÄ±lÄ±nÃ§, lufeng, Sertac TURKEL via snagel)
+
+* NUTCH-1737 Upgrade to recent JUnit 4.x (lewismc)
+
+* NUTCH-1733 parse-html to support HTML5 charset definitions (snagel)
+
+* NUTCH-1671 indexchecker to add digest field (snagel, lufeng)
+
+Nutch 1.8      - 11/03/2014 (dd/mm/yyyy)
+Release Report - http://s.apache.org/oHY
+
+* NUTCH-1706 IndexerMapReduce does not remove db_redir_temp (markus, snagel)
+
+* NUTCH-1113 SegmentMerger can now be safely used to merge segments (Edward 
Drapkin, markus, snagel)
+
+* NUTCH-1729 Upgrade to Tika 1.5 (jnioche)
+
+* NUTCH-1707 DummyIndexingWriter (markus)
+
+* NUTCH-1721 Upgrade to Crawler commons 0.3 (tejasp)
+
+* NUTCH-1253 Incompatable neko and xerces versions (snagel, lewismc)
+
+* NUTCH-1715 RobotRulesParser adds additional '*' to the robots name (tejasp)
+
+* NUTCH-356 Plugin repository cache can lead to memory leak (Enrico Triolo, 
DoÄacan GÃ¼ney via markus)
+
+* NUTCH-1413 Record response time (Yasin KÄ±lÄ±nÃ§, Talat Uyarer, snagel)
+
+* NUTCH-1680 CrawlDbReader to dump minRetry value (markus)
+
+* NUTCH-1699 Tika Parser - Image Parse Bug (Mehmet Zahid YÃ¼zÃ¼gÃ¼ldÃ¼, snagel 
via lewismc)
+
+* NUTCH-1695 Add NutchDocument.toString() to ease debugging (markus)
+
+* NUTCH-1675 NutchField to support long (markus)
+
+* NUTCH-1670 set same crawldb directory in mergedb parameter (lufeng via 
tejasp)
+
+* NUTCH-1080 Type safe members, arguments for better readability (tejasp)
+
+* NUTCH-1360 Suport the storing of IP address connected to when web crawling 
(lewismc, ferdy and Yasin KÄ±lÄ±nÃ§)
+
+* NUTCH-1681 In URLUtil.java, toUNICODE method does not work correctly 
(Ä°lhami KALKAN, snagel via markus)
+
+* NUTCH-1668 Remove package org.apache.nutch.indexer.solr (jnioche)
+
+* NUTCH-1621 Remove deprecated class o.a.n.crawl.Crawler (Rui Gao via jnioche)
+
+* NUTCH-656 Generic Deduplicator (jnioche, snagel)
+
+* NUTCH-1100 Avoid NPE in SOLRDedup (markus)
+
+* NUTCH-1666 Optimisation for BasicURLNormalizer (jnioche)
+
+* NUTCH-1656 ParseMeta not passed to CrawlDatum for not_modified (markus)
+
+* NUTCH-1606 Check that Factory classes use the cache in a thread safe way 
(jnioche)
+
+* NUTCH-1653 AbstractScoringFilter (jnioche)
+
+* NUTCH-1562 Order of execution for scoring filters (jnioche, snagel)
+
+* NUTCH-1640 Reuse ParseUtil instance in ParseSegment (Mitesh Singh Jat via 
jnioche)
+
+* NUTCH-1639 bin/crawl fails on mac os (various contributors via snagel)
+
+* NUTCH-1646 IndexerMapReduce to consider DB status (markus)
+
+* NUTCH-1636 Indexer to normalize and filter repr URL (Iain Lopata via snagel)
+
+* NUTCH-1637 URLUtil is missing getProtocol (markus)
+
+* NUTCH-1622 Create Outlinks with metadata (jnioche)
+
+* NUTCH-1629 Injector skips empty lines in seed files (kaveh minooie via 
jnioche)
+
+* NUTCH-911 protocol-file to return proper protocol status (Peter Lundberg via 
snagel)
+
+* NUTCH-806 Merge CrawlDBScanner with CrawlDBReader (jnioche)
+
+* NUTCH-1587 misspelled property "threshold" in conf/log4j.properties (snagel)
+
+* NUTCH-1604 ProtocolFactory not thread-safe (jnioche)
+
+* NUTCH-1595 Upgrade to Tika 1.4 (jnioche, markus)
+
+* NUTCH-1598 ElasticSearchIndexer to read ImmutableSettings from config 
(markus)
+
+* NUTCH-1520 SegmentMerger looses records (markus)
+
+* NUTCH-1602 improve the readability of metadata in readdb dump normal (lufeng)
+
+* NUTCH-1596 HeadingsParseFilter not thread safe (snagel via markus)
+
+* NUTCH-1597 HeadingsParseFilter to trim and remove exess whitespace (markus)
+
+* NUTCH-1601 ElasticSearchIndexer fails to properly delete documents (markus)
+
+* NUTCH-1600 Injector overwrite does not always work properly (markus)
+
+* NUTCH-1581 CrawlDB csv output to include metadata (markus)
+
+* NUTCH-1327 QueryStringNormalizer (markus)
+
+* NUTCH-1593 Normalize option missing in SegmentMerger's usage (markus)
+
+* NUTCH-1580 index-static returns object instead of value for index.static 
(Antoinette, lewismc, snagel)
+
+* NUTCH-1126 JUnit test for urlfilter-prefix (Talat UYARER via markus)
+
+Apache Nutch 1.7 Release - 06/20/2013 (mm/dd/yyyy)
+Release report - http://s.apache.org/1zE
+
+* NUTCH-1585 Ensure duplicate tags do not exist in microformat-reltag tag set. 
(lewismc)
+
+* NUTCH-1583 Headings plugin to support multivalued headings (markus)
+
+* NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched 
in CrawlDb (snagel)
+
+* NUTCH-1527 Elasticsearch indexer (lufeng + markus)
+
+* NUTCH-1475 Index-More Plugin -- A better fall back value for date field 
(James Sullivan, snagel via lewismc)
+
+* NUTCH-1560 index-metadata to add all values of multivalued metadata (snagel)
+
+* NUTCH-1467 Not able to parse mutliValued metatags (kiran via snagel)
+
+* NUTCH-1430 Freegenerator records overwrite CrawlDB records with 
AdaptiveFetchSchedule (markus)
+
+* NUTCH-1522 Upgrade to Tika 1.3 (jnioche)
+
+* NUTCH-1578 Upgrade to Hadoop 1.2.0 (markus)
+
+* NUTCH-1577 Add target for creating eclipse project (tejasp)
+
+* NUTCH-1513 Support Robots.txt for Ftp urls (tejasp)
+
+* NUTCH-1249 and NUTCH-1275 : Resolve all issues flagged up by adding javac 
-Xlint argument (tejasp)
+
+* NUTCH-1053 Parsing of RSS feeds fails (tejasp)
+
+* NUTCH-956 solrindex issues: add field tld to Solr schema (Alexis via 
lewismc, snagel)
+
+* NUTCH-1277 Fix [fallthrough] javac warnings (tejasp)
+
+* NUTCH-1514 Phase out the deprecated configuration properties (if possible) 
(tejasp)
+
+* NUTCH-1334 NPE in FetcherOutputFormat (jnioche via tejasp)
+
+* NUTCH-1549 Fix deprecated use of Tika MimeType API in o.a.n.util.MimeUtil 
(tejasp)
+
+* NUTCH-346 Improve readability of logs/hadoop.log (Renaud Richardet via 
tejasp)
+
+* NUTCH-829 duplicate hadoop temp files (Mike Baranczak, lewismc, tejasp)
+
+* NUTCH-1501 Harmonize behavior of parsechecker and indexchecker (snagel + 
lewismc)
+
+* NUTCH-1031 Delegate parsing of robots.txt to crawler-commons (tejasp)
+
+* NUTCH-1547 BasicIndexingFilter - Problem to index full title (Feng)
+
+* NUTCH-1389 parsechecker and indexchecker to report truncated content (snagel)
+
+* NUTCH-1419 parsechecker and indexchecker to report protocol status (snagel + 
lewismc)
+
+* NUTCH-1047 Pluggable indexing backends (jnioche)
+
+* NUTCH-1536 Ant build file has hardcoded conf dir location (zm via lewismc)
+
+* NUTCH-1420 Get rid of the dreaded ï¿½ (markus via lewismc)
+
+* NUTCH-1521 CrawlDbFilter pass null url to urlNormalizers (Lufeng via lewismc)
+
+* NUTCH-1284 Add site fetcher.max.crawl.delay as log output by default (tejasp)
+
+* NUTCH-1453 Substantiate tests for IndexingFilters (lufeng via lewismc)
+
+* NUTCH-840  Port tests from parse-html to parse-tika (lewismc, jnioche)
+
+* NUTCH-1509 Implement read/write in NutchField (markus)
+
+* NUTCH-1507 Remove FetcherOutput (markus)
+
+* NUTCH-1506 Add UPDATE action to NutchIndexAction (markus)
+
+* NUTCH-1500 bin/crawl fails on step solrindex with wrong path to segment 
(Tristan Buckner, snagel)
+
+* NUTCH-1274 Fix [cast] javac warnings (tejasp via lewismc)
+
+* NUTCH-1494 RSS feed plugin seems broken (Sourajit Basak, tejasp and lewismc)
+
+* NUTCH-1127 JUnit test for urlfilter-validator (tejasp via lewismc)
+
+* NUTCH-1119 JUnit test for index-static (tejasp via lewismc)
+
+* NUTCH-1510 Upgrade to Hadoop 1.1.1 (markus)
+
+* NUTCH-1118 JUnit test for index-basic (tejasp via lewismc)
+
+* NUTCH-1331 limit crawler to defined depth (jnioche)
+
+Release 1.6 - 23/11/2012
+
+* NUTCH-1370 Expose exact number of urls injected @runtime (snagel via lewismc)
+
+* NUTCH-1117 JUnit test for index-anchor (lewismc)
+
+* NUTCH-1451 Upgrade automaton jar to 1.11-8 (lewismc)
+
+* NUTCH-1488 bin/nutch to run junit from any directory (snagel via lewismc)
+
+* NUTCH-1493 Error adding field 'contentLength'='' during solrindex using 
index-more (Nathan Gass via lewismc)
+
+* NUTCH-1491 Strip UTF-8 non-character codepoints in title (Nathan Gass via 
markus)
+
+* NUTCH-1421 RegexURLNormalizer to only skip rules with invalid patterns 
(snagel)
+
+* NUTCH-1341 NotModified time set to now but page not modified (markus)
+
+* NUTCH-1215 UpdateDB should not require segment as input (markus)
+
+* NUTCH-1383 IndexingFiltersChecker to show error message instead of null 
pointer exception (snagel)
+
+* NUTCH-1476 SegmentReader getStats should set parsed = -1 if no parsing took 
place (snagel)
+
+* NUTCH-1252 SegmentReader -get shows wrong data (snagel)
+
+* NUTCH-1344 BasicURLNormalizer to normalize https same as http (snagel)
+
+* NUTCH-706 Url regex normalizer: pattern for session id removal not to match 
"newsId" (Meghna Kukreja via snagel)
+
+* NUTCH-1415 release packages to contain top level folder apache-nutch-x.x 
(snagel)
+
+* NUTCH-1441 AnchorIndexingFilter should use plain HashSet (ferdy via lewismc)
+
+* NUTCH-1470 Ensure test files are included for runtime testing (lewismc)
+
+* NUTCH-1434 Indexer to delete robots noindex (markus)
+
+* NUTCH-1443 Solr schema version is invalid (markus)
+
+* NUTCH-1417 Remove o.a.n.metadata.Office (lewismc)
+
+* NUTCH-1376 Add description parameter to every ant task (lewismc)
+
+* NUTCH-1440 reconfigure non-existent stopwords_en.txt in schema-solr4.xml 
(shekhar sharma via lewismc)
+
+* NUTCH-1439 Define boost field as type float in schema-solr4.xml (shekhar 
sharma via lewismc)
+
+* NUTCH-1433 Upgrade to Tika 1.2 (jnioche)
+
+* NUTCH-1388 Optionally maintain custom fetch interval despite 
AdaptiveFetchSchedule (markus)
+
+* NUTCH-1430 Freegenerator records overwrite CrawlDB records with 
AdaptiveFetchSchedule (markus)
+
+* NUTCH-1087 Deprecate crawl command and replace with example script (jnioche)
+
+* NUTCH-1306 Add option to not commit and clarify existing solr.commit.size 
(ferdy)
+
+* NUTCH-1405 Allow to overwrite CrawlDatum's with injected entries (markus)
+
+* NUTCH-1412 Upgrade commons lang (markus)
+
+* NUTCH-1251 SolrDedup to use proper Lucene catch-all query (Arkadi Kosmynin 
via markus)
+
+* NUTCH-1407 BasicIndexingFilter to optionally add domain field (markus)
+
+* NUTCH-1408 RobotRulesParser main doesn't take URL's (markus)
+
+* NUTCH-1300 Indexer to filter normalize URL's (markus)
+
+* NUTCH-1330 WebGraph OutlinkDB to preserve back up (markus)
+
+* NUTCH-1319 HostNormalizer plugin (markus)
+
+* NUTCH-1386 Headings filter not to add empty values (markus)
+
+* NUTCH-1356 ParseUtil use ExecutorService instead of manually thread handling 
(ferdy via markus)
+
+* NUTCH-1352 Improve regex urlfilters/normalizers synchronization (ferdy via 
markus)
+
+* NUTCH-1024 Dynamically set fetchInterval by MIME-type (markus)
+
+* NUTCH-1364 Add a counter in Generator for malformed urls (lewismc)
+
+* NUTCH-1262 Map `duplicating` content-types to a single type (markus)
+
+* NUTCH-1385 More robust plug-in order properties in nutch-site.xml (Andy Xue 
via markus)
+
+* NUTCH-1336 Optionally not index db_notmodified pages (markus)
+
+* NUTCH-1346 Follow outlinks to ignore external (markus)
+
+* NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (markus)
+
+* NUTCH-1351 DomainStatistics to aggregate by TLD (markus)
+
+* NUTCH-1381 Allow to override default subcollection field name (markus)
+
+* NUTCH-XX Commit to add configuration for separation of ant distribution 
targets (lewismc + jnioche)
+
+Release 1.5.1 - 07/10/2012
+
+* NUTCH-1404 Nutch script fails to find job file in deploy mode (sidabatra, 
jnioche)
+
+* NUTCH-1415 release packages to contain top level folder apache-nutch-x.x 
(snagel via lewismc)
+
+* NUTCH-1400 Remove developer -core option for bin/nutch (jnioche)
+
+* NUTCH-1384 Typo in ParseSegment's run-method (Matthias Agethle via markus)
+
+* NUTCH-1398 Upgrade to Hadoop 1.0.3 (jnioche)
+
+Release 1.5 - 04/15/2012
+
+* NUTCH-1208 Don't include KEYS file in bin distribution (jnioche)
+
+* NUTCH-1234 Upgrade to Tika 1.1 (jnioche, markus)
+
+* NUTCH-809 Parse-metatags plugin (jnioche)
+
+* NUTCH-1310 Nutch to send HTTP-accept header (markus)
+
+* NUTCH-1305 Domain(blacklist)URLFilter to trim entries (markus)
+
+* NUTCH-1307 Improve formatting of ant targets for clearer project help 
(lewismc)
+
+* NUTCH-1299 LinkRank inverter to ignore records without Node (markus)
+
+* NUTCH-1258 MoreIndexingFilter should be able to read Content-Type from both 
parse metadata and content metadata (jnioche, markus)
+
+* NUTCH-1293 IndexingFiltersChecker to store detected content type in 
crawldatum metadata (markus)
+
+* NUTCH-1291 Fetcher to stringify exception on // unexpected exception (markus)
+
+* NUTCH-965 Skip parsing for truncated documents (alexis, lewismc, ferdy)
+
+* NUTCH-1210 DomainBlacklistFilter (markus)
+
+* NUTCH-1193 Incorrect url transform to lowercase: parameter solr (Eduardo dos 
Santos Leggiero via lewismc)
+
+* NUTCH-1272 Wrong property name for index-static in nutch-default.xml (Daniel 
Baur via jnioche)
+
+* NUTCH-1259 Store detected content-type in crawldatum metadata (jnioche, 
markus)
+
+* NUTCH-1266 Subcollection to optionally write to configured fields (markus)
+
+* NUTCH-1005 Parse headings plugin (markus)
+
+* NUTCH-1264 Configurable indexing plugin index-metadata (jnioche)
+
+* NUTCH-1242 Allow disabling of URL Filters in ParseSegment (Edward Drapkin 
via markus)
+
+* NUTCH-1256 WebGraph to dump host + score (markus)
+
+* NUTCH-1260 Fetcher should log fetching of redirects (Sebastian Nagel via 
markus)
+
+* NUTCH-1255 Change ivy.xml of all plugins to remove "nutch.root" property 
(ferdy)
+
+* NUTCH-1248 Generator to select on status (markus)
+
+* NUTCH-1177 Generator to select on retry interval (markus)
+
+* NUTCH-1246 Upgrade to Hadoop 1.0.0 (jnioche)
+
+* NUTCH-1139 Indexer to delete gone documents (markus)
+
+* NUTCH-1244 CrawlDBDumper to filter by regex (markus)
+
+* NUTCH-1237 Improve javac arguements for more verbose ouput (lewismc)
+
+* NUTCH-1236 Add link to site documentation to download older versions of 
Nutch (lewismc)
+
+* NUTCH-1146 Prevent generation of _SUCCESS files in output (jnioche)
+
+* NUTCH-1232 Remove site field from index-basic (markus)
+
+* NUTCH-1239 Webgraph should remove deleted pages from segment input (markus)
+
+* NUTCH-1238 Fetcher throughput threshold must start before feeder finished 
(markus)
+
+* NUTCH-1138 remove LogUtil from trunk and nutch gora (lewismc)
+
+* NUTCH-1231 Upgrade to Tika 1.0 (markus)
+
+* NUTCH-1230 MimeType API deprecated and breaks with Tika 1.0 (markus)
+
+* NUTCH-1235 Upgrade to new Hadoop 0.20.205.0 (markus)
+
+* NUTCH-1217 Update NOTICE.txt to drop some copyrights (lewismc)
+
+* NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j 
(markus)
+
+* NUTCH-1184 Fetcher to parse and follow Nth degree outlinks (markus)
+
+* NUTCH-1221 Migrate DomainStatistics to MapReduce API (markus)
+
+* NUTCH-1216 Add trivial comment to lib/native/README.txt (lewismc)
+
+* NUTCH-1214 DomainStats tool should be named for what it's doing (markus)
+
+* NUTCH-1213 Pass additional SolrParams when indexing to Solr (ab)
+
+* NUTCH-1211 URLFilterChecker command line help doesn't inform user of 
+  STDIN requirements (mattmann)
+
+* NUTCH-1209 Output from ParserChecker Url missing a newline (mattmann)
+
+* NUTCH-1207 ParserChecker to output signature (markus)
+
+* NUTCH-1090 InvertLinks should inform when ignoring internal links (Marek 
Backmann via markus)
+
+* NUTCH-1174 Outlinks are not properly normalized (markus)
+
+* NUTCH-1203 ParseSegment to show number of milliseconds per parse (markus)
+
+* NUTCH-1185 Decrease solr.commit.size to 250 (markus)
+
+* NUTCH-1180 UpdateDB to backup previous CrawlDB (markus)
+
+* NUTCH-1173 DomainStats doesn't count db_not_modified (markus)
+
+* NUTCH-1155 Host/domain limit in generator is generate.max.count+1 (markus)
+
+* NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex 
(markus)
+
+* NUTCH-1178 Incorrect CSV header CrawlDatumCsvOutputFormat (markus)
+
+* NUTCH-1142 Normalization and filtering in WebGraph (markus)
+
+* NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS 
file (markus)
+
+Release 1.4 - 11/4/2011
+
+* NUTCH-1195 Add Solr 4x (trunk) example schema (ab)
+
+* NUTCH-1192 Add '/runtime' to svn ignore (ferdy)
+
+* NUTCH-1097 application/xhtml+xml should be enabled in plugin.xml of 
parse-html; allow multiple mimetypes for plugin.xml (Ferdy via lewismc)
+
+* NUTCH-797 Fix parse-tika and parse-html to use relative URL resolution per 
RFC-3986
+  (Robert Hohman, ab)
+
+* NUTCH-1154 Upgrade to Tika 0.10. NOTE: Tika's new RTF parser may ignore more
+  text in malformed documents than previously - see TIKA-748 for details. (ab)
+
+* NUTCH-1109 Add Sonar targets to Ant build.xml (lewismc) 
+
+* NUTCH-1152 Upgrade SolrJ to version 3.4.0 (ab)
+
+* NUTCH-1136 Ant pmd target is broken (lewismc)
+
+* NUTCH-1058 Upgrade Solr schema to version 1.4 (markus)
+
+* NUTCH-1137 LinkDB invertlinks other options ignored when using -dir option 
(Sebastian Nagel, markus)
+
+* NUTCH-1141 Configurable Fetcher queue depth (jnioche)
+
+* NUTCH-1091 Remove commons logging dependency from Nutch branch and trunk 
(lewismc)
+
+* NUTCH-672 allow unit tests to be run from bin/nutch (Todd Lipton via lewismc)
+
+* NUTCH-937 Put plugins in classes/plugins in job file (Claudio Martella, 
Ferdy Galema, jnioche)
+
+* NUTCH-623 Change plugin source directory "languageidentifier" to 
"language-identifier" (lewismc)
+
+* NUTCH-1074 topN is ignored with maxNumSegments and generate.max.count 
(Robert Thomson via markus)
+
+* NUTCH-1078 Upgrade all instances of commons logging to slf4j (with log4j 
backend) (lewismc)
+
+* NUTCH-1115 Option to disable fixing embedded URL parameters in 
DomContentUtils (markus)
+
+* NUTCH-1114 Attr file missing in domain filter (markus)
+
+* NUTCH-1067 Configure minimum throughput for fetcher (markus)
+
+* NUTCH-1102 Fetcher to rely on fetcher.parse directive (markus)
+
+* NUTCH-1110 UpdateDB must not write _success file (markus)
+
+* NUTCH-1105 Max content length option for index-basic (markus)
+
+* NUTCH-940 static field plugin (Claudio Martella via lewismc)
+
+* NUTCH-914 Implement Apache Project Branding Requirements (lewismc)
+
+* NUTCH-1095 remove i18n from Nutch site to archive and legacy secton of wiki 
(lewismc)
+
+* NUTCH-1101 Option to purge db_gone records with updatedb (markus)
+
+* NUTCH-1096 Empty (not null) ContentLength results in failure of fetch (Ferdy 
Galema via jnioche)
+
+* NUTCH-1073 Rename parameters 'fetcher.threads.per.host.by.ip' and 
'fetcher.threads.per.host' (jnioche)
+
+* NUTCH-1089 Short compressed pages caused exception in protocol-httpclient 
(Simone Frenzel via jnioche)
+
+* NUTCH-1085 Nutch script does not require HADOOP_HOME (jnioche)
+
+* NUTCH-1075 Delegate language identification to Tika (jnioche)
+
+* NUTCH-1049 Add classes to bin/nutch script (markus)
+
+* NUTCH-1051 Export WebGraph node scores for Solr.ExternalFileField (markus)
+
+* NUTCH-1083 ParserChecker implements Tools (jnioche)
+
+* NUTCH-1082 IndexingFiltersChecker utility does not list multi valued fields 
(markus)
+
+* NUTCH-1004 Do not index empty values for title field (markus)
+
+* NUTCH-914 Implement Apache Project Branding Requirements (lewismc via 
jnioche)
+
+* NUTCH-1069 Readlinkdb broken on Hadoop > 0.20 (markus)
+
+* NUTCH-1044 Redirected URLs and possibly all of their outlinked URLs have 
invalid scores (jnioche)
+
+* NUTCH-1028 Log urls when parsing (markus)
+
+* NUTCH-1065 New mvn.template (lewismc)
+
+* NUTCH-1072 Display number and size of queues in Fetcher status (jnioche)
+
+* NUTCH-1071 Crawldb update displays total number of URLs per status (jnioche)
+
+* NUTCH-1045 MimeUtil to rely on default config provided by Tika (jnioche)
+
+* NUTCH-1057 Fetcher thread time out configurable (markus)
+
+* NUTCH-1037 Option to deduplicate anchors prior to indexing (markus)
+
+* NUTCH-1050 Add segmentDir option to WebGraph (markus)
+
+* NUTCH-1055 upgrade package.html file in language identifier plugin (lewismc)
+
+* NUTCH-1059 Remove convdb command from /bin/nutch (lewismc)
+
+* NUTCH-1019 Edit comment in org.apache.nutch.crawl.Crawl to reflect removal 
of legacy (lewismc)
+
+* NUTCH-1023 Trivial error in error message for 
org.apache.nutch.crawl.LinkDbReader (lewismc)
+
+* NUTCH-1043 Add pattern for filtering .js in default url filters (jnioche)
+
+* NUTCH-1054 LinkDB optional during indexing (jnioche)
+
+* NUTCH-1029 Readdb throws EOFException (markus)
+
+* NUTCH-1036 Solr jobs should increment counters in Reporter (markus)
+
+* NUTCH-987 Support HTTP auth for Solr communication (markus)
+
+* NUTCH-1027 Degrade log level of `can't find rules for scope` (markus)
+
+* NUTCH-783 IndexingFiltersChecker utility (jnioche via markus)
+
+* NUTCH-1030 WebgraphDB program requires manually added directories (markus)
+
+* NUTCH-1011 Normalize duplicate slashes in URL's (markus)
+
+* NUTCH-993 NullPointerException at FetcherOutputFormat.checkOutputSpecs 
(Christian Guegi via jnioche)
+
+* NUTCH-1013 Migrate RegexURLNormalizer from Apache ORO to java.util.regex 
(markus)
+
+* NUTCH-1016 Strip UTF-8 non-character codepoints and add logging for 
SolrWriter (markus)
+
+* NUTCH-1012 Cannot handle illegal charset $charset (markus)
+
+* NUTCH-1022 Upgrade version number of Nutch agent in conf (markus)
+
+* NUTCH-295 Description for fetcher.threads.fetch property (kubes via markus)
+
+* NUTCH-1000 Add option not to commit to Solr (markus)
+
+* NUTCH-1006 MetaEquiv with single quotes not accepted (markus)
+
+* NUTCH-1010 ContentLength not trimmed (markus)
+
+Release 1.3 - 6/4/2011
+
+* NUTCH-995 Generate POM file using the Ivy makepom task (mattmann, jnioche, 
Gabriele Kahlout)
+
+* NUTCH-1003 task 'package' does not reflect the new organisation of the code 
(jnioche)
+
+* NUTCH-994 Fine tune Solr schema (markus)
+
+* NUTCH-997 IndexingFitlers to store Date objects instead of Strings (jnioche)
+
+* NUTCH-996 Indexer adds solr.commit.size+1 docs (markus)
+
+* NUTCH-983 Upgrade SolrJ to 3.1 (markus, jnioche)
+
+* NUTCH-989 Index-basic plugin and Solr schema now use date fieldType for 
tstamp field (markus)
+
+* NUTCH-888 Remove parse-rss and add tests for rss to parse-tika (jnioche)
+
+* NUTCH-991 SolrDedup must issue a commit (markus)
+
+* NUTCH 986 SolrDedup fails due to date incorrect format (markus)
+
+* NUTCH-977 SolrMappingReader uses hardcoded configuration parameter name for 
mapping file (markus)
+
+* NUTCH-976 Rename properties solrindex.* to solr.* (markus)
+
+* NUTCH-890 Fix IllegalAccessError with slf4j used in Solrj (markus)
+
+* NUTCH-891 Subcollection plugin won't require blacklist any more (markus)
+
+* NUTCH-972 CrawlDbMerger doesn't break on non-existent input (Gabriele 
Kahlout via jnioche)
+
+* NUTCH-967 Upgrade to Tika 0.9 (jnioche)
+
+* NUTCH-975 Fix missing/wrong headers in source files (markus)
+
+* NUTCH-963 Add support for deleting Solr documents with STATUS_DB_GONE in 
CrawlDB (Claudio Martella, markus)
+
+* NUTCH-825 Publish nutch artifacts to central maven repository (mattmann, 
jnioche)
+
+* NUTCH-962 max. redirects not handled correctly: fetcher stops at max-1 
redirects (Sebastian Nagel via ab)
+
+* NUTCH-921 Reduce dependency of Nutch on config files (ab)
+
+* NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
+
+* NUTCH-872 Change the default fetcher.parse to FALSE (ab)
+
+* NUTCH-564 External parser supports encoding attribute (Antony Bowesman, 
mattmann)
+
+* NUTCH-964 Upgraded Xerces to 2.91, ERROR conf.Configuration - Failed to set 
setXIncludeAware (markus)
+
+* NUTCH-927 Fetcher.timelimit.mins is invalid when depth is greater than 1 
(Wade Lau via jnioche)
+
+* NUTCH-824 Crawling - File Error 404 when fetching file with an hexadecimal 
character in the file name (Michela Becchi via jnioche)
+
+* NUTCH-954 Strict application of Content-Length limit for http protocols 
(Alexis Detreglode via jnioche)
+
+* NUTCH-950 DomainURLFilter throws NPE on bogus urls (Alexis Detreglode via 
jnioche)
+
+* NUTCH-935 basicurlnormalizer removes unnecessary /./ in URLs (Stondet via 
markus)
+
+* NUTCH-912 MoreIndexingFilter does not parse docx and xlsx date formats 
(Markus Jelsma, jnioche)
+
+* NUTCH-886 A .gitignore file for Nutch (dogacan)
+
+* NUTCH-930 Remove remaining dependencies on Lucene API (ab)
+
+* NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche) 
+
+* NUTCH-936 LanguageIdentifier should not set empty lang field on 
NutchDocument (Markus Jelsma via jnioche)
+
+* NUTCH-787 ScoringFilters should not override the injected score (jnioche)
+
+* NUTCH-949 Conflicting ANT jars in classpath (jnioche)
+
+* NUTCH-863 Benchmark and a testbed proxy server (ab)
+
+* NUTCH-844 Improve NutchConfiguration (ab)
+
+* NUTCH-845 Native hadoop libs not available through maven (ab)
+
+* NUTCH-843 Separate the build and runtime environments (ab)
+
+* NUTCH-821 Use ivy in nutch builds (Enis Soztutar, jnioche)
+
+* NUTCH-837 Remove search servers and Lucene dependencies (ab)
+
+* NUTCH-836 Remove deprecated parse plugins (jnioche)
+
+* NUTCH-939 Added -dir command line option to SolrIndexer (Claudio Martella 
via ab)
+
+* NUTCH-948 Remove Lucene dependencies (ab)
+
+Release 1.2 - 09/18/2010
+
+* NUTCH-901 Make index-more plug-in configurable (Markus Jelsma via mattmann)
+
+* NUTCH-908 Infinite Loop and Null Pointer Bugs in Searching (kubes via 
mattmann)
+
+* NUTCH-906 Nutch OpenSearch sometimes raises DOMExceptions (Asheesh Laroia 
via ab)
+
+* NUTCH-862 HttpClient null pointer exception (Sebastian Nagel via ab)
+
+* NUTCH-905 Configurable file protocol parent directory crawling (Thorsten 
Scherler, mattmann, ab)
+
+* NUTCH-877 Allow setting of slop values for non-quote phrase queries on 
query-basic plugin (kubes via jnioche)
+
+* NUTCH-716 Make subcollection index filed multivalued (Dmitry Lihachev via 
jnioche)
+
+* NUTCH-878 ScoringFilters should not override the injected score 
+
+* NUTCH-870 Injector should add the metadata before calling injectedScore 
(jnioche via mattmann)
+
+* NUTCH-858 No longer able to set per-field boosts on lucene documents (ab)
+
+* NUTCH-869 Add parse-html back (jnioche)
+
+* NUTCH-871 MoreIndexingFilter missing date format (Max Lynch via mattmann)
+
+* NUTCH-696 Timeout for Parser (ab, jnioche)
+
+* NUTCH-857 DistributedBeans should not close their RPC counterparts (kubes)
+  
+* NUTCH-855 ScoringFilter and IndexingFilter: To allow for the propagation of 
URL Metatags 
+  and their subsequent indexing (Scott Gonyea via mattmann)
+
+* NUTCH-677 Segment merge filering based on segment content (Marcin 
Okraszewski via mattmann)
+
+* NUTCH-774 Retry interval in crawl date is set to 0 (Reinhard Schwab via 
mattmann)
+
+* NUTCH-697 Generate log output for solr indexer and dedup (Dmitry Lihachev, 
Jeroen van Vianen via mattmann)
+
+* NUTCH-850 SolrDeleteDuplicates needs to clone the SolrRecord objects 
(jnioche)
+
+* NUTCH-838 Add timing information to all Tool classes (Jeroen van Vianen, 
mattmann)
+
+* NUTCH-835 Document deduplication failed using MD5Signature (Sebastian Nagel 
via ab)
+
+* NUTCH-831 Allow configuration of how fields crawled by Nutch are stored / 
indexed / 
+  tokenized (Jeroen van Vianen via mattmann)
+
+* NUTCH-278 Fetcher-status might need clarification: kbit/s instead of kb/s 
shown (Alex McLintock via mattmann)
+
+* NUTCH-833 Website is still Lucene branded (mattmann, Alex McLintock)
+
+* NUTCH-832 Website menu has lots of broken links - in particular the API docs 
(Alex McLintock via mattmann)
+
+Release 1.1 - 2010-06-06
+
+* NUTCH-819 Included Solr schema.xml and solrindex-mapping.xml don't play 
together (ab)
+
+* NUTCH-818 Bugfix : Parse-tika uses minorCodes instead of majorCodes in 
ParseStatus (jnioche)
+
+* NUTCH-816 Add zip target to build.xml (mattmann)
+
+* NUTCH-732 Subcollection plugin not working (Filipe Antunes, ab)
+
+* NUTCH-815 Invalid blank line before If-Modified-Since header (Pascal 
Dimassimo via ab)
+
+* NUTCH-814 SegmentMerger bug (Rob Bradshaw, ab)
+
+* NUTCH-812 Crawl.java incorrectly uses the Generator API resulting in NPE 
(Phil Barnett via mattmann and ab)
+
+* NUTCH-810 Upgrade to Tika 0.7 (jnioche)
+
+* NUTCH-785 Copy metadata from origin URL when redirecting in Fetcher + call 
scfilters.initialScore on newly created URL (jnioche)
+
+* NUTCH-779 Mechanism for passing metadata from parse to crawldb (jnioche)
+
+* NUTCH-784 CrawlDBScanner (jnioche)
+
+* NUTCH-762 Generator can generate several segments in one parse of the 
crawlDB (jnioche)
+
+* NUTCH-740 Configuration option to override default language for fetched 
pages (Marcin Okraszewski via jnioche)
+
+* NUTCH-803 Upgrade to Hadoop 0.20.2 (ab)
+
+* NUTCH-787 Upgrade Lucene to 3.0.1. (Dawid Weiss via ab)
+
+* NUTCH-796 Zero results problems difficult to troubleshoot due to lack of 
logging (ab)
+
+* NUTCH-801 Remove RTF and MP3 parse plugins (jnioche)
+
+* NUTCH-798 Upgrade to SOLR1.4 and its dependencies (jnioche)
+
+* NUTCH-799 SOLRIndexer to commit once all reducers have finished (jnioche)
+
+* NUTCH-782 Ability to order htmlparsefilters (jnioche)
+
+* NUTCH-719 fetchQueues.totalSize incorrect in Fetcher (Steven Denny via 
jnioche) 
+
+* NUTCH-790 Some external javadoc links are broken (siren)
+
+* NUTCH-766 Tika parser (jnioche via mattmann)
+
+* NUTCH-786 Improvement to the list of suffix domains (jnioche)
+
+* NUTCH-775 Enhance searcher interface (siren)
+
+* NUTCH-781 Update Tika to v0.6 (jnioche)
+
+* NUTCH-269 CrawlDbReducer: OOME because no upper-bound on inlinks count 
(stack + jnioche)
+
+* NUTCH-655 Injecting Crawl metadata (jnioche)
+
+* NUTCH-658 Use counters to report fetching and parsing status (jnioche)
+
+* NUTCH-777 Upgrading to jetty6 broke unit tests (mattmann)
+
+* NUTCH-767 Update Tika to v0.5 for the MimeType detection (Julien Nioche via 
ab)
+
+* NUTCH-769 Fetcher to skip queues for URLS getting repeated exceptions
+  (Julien Nioche via ab)
+
+* NUTCH-768 - Upgrade Nutch 1.0 to use Hadoop 0.20.1, also upgrades Xerces to 
+  version 2.9.1. (kubes)
+  
+* NUTCH-712 ParseOutputFormat should catch java.net.MalformedURLException
+  coming from normalizers (Julien Nioche via ab)
+
+* NUTCH-741 Job file includes multiple copies of nutch config files
+  (Kirby Bohling via ab)
+
+* NUTCH-739 SolrDeleteDuplications too slow when using hadoop (Dmitry Lihachev 
via ab)
+
+* NUTCH-738 Close SegmentUpdater when FetchedSegments is closed
+  (Martina Koch, Kirby Bohling via ab)
+
+* NUTCH-746 NutchBeanConstructor does not close NutchBean upon 
contextDestroyed,
+  causing resource leak in the container. (Kirby Bohling via ab)
+
+* NUTCH-772 Upgrade Nutch to use Lucene 2.9.1 (ab)
+
+* NUTCH-760 Allow field mapping from Nutch to Solr index (David Stuart, ab)
+
+* NUTCH-761 Avoid cloning CrawlDatum in CrawlDbReducer (Julien Nioche, ab)
+
+* NUTCH-753 Prevent new Fetcher from retrieving the robots twice (Julien 
Nioche via ab)
+
+* NUTCH-773 - Some minor bugs in AbstractFetchSchedule (Reinhard Schwab via ab)
+
+* NUTCH-765 - Allow Crawl class to call Either Solr or Lucene Indexer (kubes)
+
+* NUTCH-735 - crawl-tool.xml must be read before nutch-site.xml when
+  invoked using crawl command (Susam Pal via dogacan)
+
+* NUTCH-721 - Fetcher2 Slow (Julien Nioche via dogacan)
+
+* NUTCH-702 - Lazy Instanciation of Metadata in CrawlDatum (Julien Nioche via 
dogacan)
+
+* NUTCH-707 - Generation of multiple segments in multiple runs returns only 1 
segment
+  (Michael Chen, ab)
+
+* NUTCH-730 - NPE in LinkRank if no nodes with which to create the WebGraph
+  (Dennis Kubes via ab)
+
+* NUTCH-731 - Redirection of robots.txt in RobotRulesParser (Julien Nioche via 
ab)
+
+* NUTCH-757 - RequestUtils getBooleanParameter() always returns false
+  (Niall Pemberton via ab)
+
+* NUTCH-754 - Use GenericOptionsParser instead of FileSystem.parseArgs() 
(Julien
+  Nioche via ab)
+
+* NUTCH-756 - CrawlDatum.set() does not reset Metadata if it is null (Julien 
Nioche
+  via ab)
+
+* NUTCH-679 - Fetcher2 implementing Tool (Julien Nioche via ab)
+
+* NUTCH-758 - Set subversion eol-style to "native" (Niall Pemberton via ab)
+
+Release 1.0 - 2009-03-23
+
+ 1. NUTCH-474 - Fetcher2 crawlDelay and blocking fix (Dogacan Guney via ab)
+
+ 2. NUTCH-443 - Allow parsers to return multiple Parse objects.
+    (Dogacan Guney et al, via ab)
+
+ 3. NUTCH-393 - Indexer should handle null documents returned by filters.
+    (Eelco Lempsink via ab)
+
+ 4. NUTCH-456 - Parse msexcel plugin speedup (Heiko Dietze via siren)
+
+ 5. NUTCH-446 - RobotRulesParser should ignore Crawl-delay values of other
+    bots in robots.txt (Dogacan Guney via siren)
+
+ 6. NUTCH-482 - Remove redundant plugin lib-log4j (siren)
+ 
+ 7. NUTCH-483 - Remove redundant commons-logging jar from ontology plugin
+    (siren)
+
+ 8. NUTCH-161 - Change Plain text parser to
+    use parser.character.encoding.default property for fall back encoding
+    (KuroSaka TeruHiko, siren)
+
+ 9. NUTCH-61 - Support for adaptive re-fetch interval and detection of
+    unmodified content. (ab)
+
+10. NUTCH-392 - OutputFormat implementations should pass on Progressable.
+    (cutting via ab)
+
+11. NUTCH-495 - Unnecessary delays in Fetcher2 (dogacan)
+
+12. NUTCH-443 - allow parsers to return multiple Parse object, this will speed 
+    up the rss parser (dogacan via mattmann). This update is a fix and 
semantics
+    change from the original patch for NUTCH-443. The original patch did not 
tell
+    the  Indexer to read crawl_parse too so that it can pickup sub-urls' fetch 
+    datums. This patch addresses that issue. Now, if Fetcher gets a null 
content, 
+    instead of pushing an empty content, it filters the null content.
+    
+13. NUTCH-485 - Change HtmlParseFilter 's to return ParseResult object instead 
of 
+    Parse object. (Gal Nitzan via dogacan)
+
+14. NUTCH-489 - URLFilter-suffix management of the url path when the url 
contains 
+    some query parameters. (Emmanuel Joke via dogacan)
+
+15. NUTCH-502 - Bug in SegmentReader causes infinite loop. 
+    (Ilya Vishnevsky via dogacan)
+    
+16. NUTCH-444 Possibly use a different library to parse RSS feed for improved 
+    performance and compatibility. This patch introduced a new plugin, feed,
+    that includes an index filter and a parse plugin for feeds that uses ROME.
+    There was discussion to remove parse-rss, in light of the feed plugin, 
+    however, this patch does not explicitly remove parse-rss. (dogacan, 
mattmann)
+
+17. NUTCH-471 - Fix synchronization in NutchBean creation. 
+    (Enis Soztutar via dogacan)
+
+18. Upgrade to Lucene 2.2.0 and Hadoop 0.12.3. (ab)
+
+19. NUTCH-468 - Scoring filter should distribute score to all outlinks at 
+    once. (dogacan)
+
+20. NUTCH-504 - NUTCH-443 broke parsing during fetching. (dogacan)
+
+21. NUTCH-497 -  Extreme Nested Tags causes StackOverflowException in 
+       DomContentUtils...Spider Trap. (kubes)
+
+22. NUTCH-434 - Replace usage of ObjectWritable with something based on 
+    GenericWritable. (dogacan)
+
+23. NUTCH-499 - Refactor LinkDb and LinkDbMerger to reuse code. (dogacan)
+
+24. NUTCH-498 - Use Combiner in LinkDb to increase speed of linkdb generation.
+    (Espen Amble Kolstad via dogacan)
+
+25. NUTCH-507 - lib-lucene-analyzers jar defintion is wrong in plugin.xml.
+    (Emmanuel Joke via dogacan)
+
+26. NUTCH-503 - Generator exits incorrectly for small fetchlists. 
+    (Vishal Shah via dogacan)
+
+27. NUTCH-505 - Outlink urls should be validated. (dogacan)
+
+28. NUTCH-510 - IndexMerger delete working dir. (Enis Soztutar via dogacan)
+
+29. NUTCH-513 - suffix-urlfilter.txt does not have a template. (dogacan)
+
+30. NUTCH-515 - Next fetch time is set incorrectly. (dogacan)
+
+30. NUTCH-506 - Nutch should delegate compression to Hadoop. (dogacan)
+
+31. NUTCH-517 - build encoding should be UTF-8. (Enis Soztutar via dogacan).
+
+32. NUTCH-518 - Fix OpicScoringFilter to respect scoring filter chaining.
+    (Enis Soztutar via dogacan)
+
+33. NUTCH-516 - Next fetch time is not set when it is a 
+    CrawlDatum.STATUS_FETCH_GONE. (Emmanuel Joke via dogacan)
+
+34. NUTCH-525 - DeleteDuplicates generates ArrayIndexOutOfBoundsException 
+    when trying to rerun dedup on a segment. (Vishal Shah via dogacan)
+
+35. NUTCH-514 - Indexer should only index pages with fetch status SUCCESS.
+    (dogacan) Note: There is a bigger problem, i.e how to deal
+    with redirected pages, and this issue can be considered as a band-aid 
+    for the time being. See NUTCH-273 and NUTCH-353 for more details. 
+
+36. NUTCH-533 - LinkDbMerger: url normalized is not updated in the key and 
+    inlinks list. (Emmanuel Joke via dogacan)
+
+37. NUTCH-535 -ParseData's contentMeta accumulates unnecessary values during 
+    parse. (dogacan)
+
+38. NUTCH-522 - Use URLValidator in the Injector. (Emmanuel Joke, dogacan)
+
+39. NUTCH-536 - Reduce number of warnings in nutch core. (dogacan)
+
+40. NUTCH-439 - Top Level Domains Indexing / Scoring. Also adds 
+    domain-related utilities. (Enis Soztutar via dogacan)
+
+41. NUTCH-544 - Upgrade Carrot2 clustering plugin to the newest stable 
+    release (2.1). (Dawid Weiss via dogacan)
+
+42. NUTCH-545 - Configuration and OnlineClusterer get initialized in every
+    request. (Dawid Weiss via dogacan)
+
+43. NUTCH-532 - CrawlDbMerger: wrong computation of last fetch time. 
+    (Emmanuel Joke via dogacan)
+
+44. NUTCH-550 - Parse fails if db.max.outlinks.per.page is -1. (dogacan)
+
+45. NUTCH-546 - file URL are filtered out by the crawler. (dogacan)
+
+46. NUTCH-554 - Generator throws IOException on invalid urls.
+    (Brian Whitman via ab)
+
+47. NUTCH-529 - NodeWalker.skipChildren doesn't work for more than 1 child.
+    (Emmanuel Joke via dogacan)
+
+48. NUTCH-25 - needs 'character encoding' detector.
+    (Doug Cook, dogacan, Marcin Okraszewski, Renaud Richardet via dogacan)
+
+49. NUTCH-508 - ${hadoop.log.dir} and ${hadoop.log.file} are not propagated
+    to the tasktracker. (Mathijs Homminga, Emmanuel Joke via dogacan)
+    
+50. NUTCH-562 - Port mime type framework to use Tika mime detection framework.
+    (mattmann)
+    
+51. NUTCH-488 - Avoid parsing uneccessary links and get a more relevant 
outlink 
+    list. (Emmanuel Joke, Marcin Okraszewski via kubes)
+
+52. NUTCH-501 -  Implement a different caching mechanism for objects cached in
+    configuration. (dogacan)
+
+53. NUTCH-552 - Upgrade Nutch to Hadoop 0.15.x. (kubes)
+
+54. NUTCH-565 - Arc File to Nutch Segments Converter. (kubes)
+
+55. NUTCH-547 - Redirection handling: YahooSlurp's algorithm.
+    (dogacan, kubes via dogacan)
+
+56. NUTCH-548 - Move URLNormalizer from Outlink to ParseOutputFormat.
+    (Emmanuel Joke via dogacan)
+
+57. NUTCH-538 - Delete unused classes under o.a.n.util. (dogacan)
+
+58. NUTCH-494 - FindBugs: CrawlDbReader and DeleteDuplicates. (dogacan)
+
+59. NUTCH-574 - Including inlink anchor text in index can create irrelevant 
+    search results.  Created index-anchor plugin, removed functionality from 
+    index-basic plugin. For backwards compatibility, add index-anchor plugin 
to 
+    nutch-site.xml plugin.includes. (kubes)
+
+60. NUTCH-581 - DistributedSearch does not update search servers added to 
+    search-servers.txt on the fly.  (Rohan Mehta via kubes)
+
+61. NUTCH-586 - Add option to run compiled classes without job file
+    (enis via ab)
+
+62. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy
+    server. (Susam Pal via dogacan)
+
+63. NUTCH-534 - SegmentMerger: add -normalize option (Emmanuel Joke via ab)
+
+64. NUTCH-528 - CrawlDbReader: add some new stats + dump into a CSV format
+    (Emmanuel Joke via ab)
+
+65. NUTCH-597 - NPE in Fetcher2 (Remco Verhoef via ab)
+
+66. NUTCH-584 - urls missing from fetchlist (Ruslan Ermilov, ab)
+
+67. NUTCH-580 - Remove deprecated hadoop api calls (FS) (siren)
+
+68. NUTCH-587 - Upgrade to Hadoop 0.15.3 (kubes)
+
+69. NUTCH-604 - Upgrade to Lucene 2.3.0 (ab)
+
+70. NUTCH-602 - Allow configurable number of handlers for search servers
+    (hartbecke via kubes)
+
+71. NUTCH-607 - Update build.xml to include tika jar when building war (kubes)
+
+72. NUTCH-608 - Upgrade nutch to use released apache-tika-0.1-incubating 
(mattmann)
+
+73. NUTCH-606 - Refactoring of Generator, run all urls through checks (kubes)
+
+74. NUTCH-605 - Change deprecated configuration methods for Hadoop (kubes)
+
+75. NUTCH-603 - Add more default url normalizations (kubes)
+
+76. NUTCH-611 - Upgrade Nutch to use Hadoop 0.16 (kubes)
+
+77. NUTCH-44 - Too many search results, limits max results returned from a 
+    single search. (Emilijan Mirceski and Susam Pal via kubes)
+
+78. NUTCH-567 - Proper (?) handling of URIs in TagSoup. TagSoup library is
+    updated to 1.2 version. (dogacan)
+
+79. NUTCH-613 - Empty summaries and cached pages (kubes via ab)
+
+80. NUTCH-612 - URL filtering was disabled in Generator when invoked
+    from Crawl (Susam Pal via ab)
+
+81. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab)
+
+82. NUTCH-575 - NPE in OpenSearchServlet (John H. Lee via ab)
+
+83. NUTCH-126 - Fetching https does not work with a proxy (Fritz Elfert via ab)
+
+84. NUTCH-615 - Redirected URL-s fetched without setting fetchInterval.
+    Guard against reprUrl being null. (Emmanuel Joke, ab)
+
+85. NUTCH-616 - Reset Fetch Retry counter when fetch is successful (Emmanuel
+    Joke, ab)
+
+86. NUTCH-220 - Upgrade to PDFBox 0.7.3 (ab)
+
+87. NUTCH-223 - Crawl.java uses Integer.MAX_VALUE (Jeff Ritchie via ab)
+
+88. NUTCH-598 - Remove deprecated use of ToolBase. Use generics in Hadoop API.
+    (Emmanuel Joke, dogacan, ab)
+
+89. NUTCH-620 - BasicURLNormalizer should collapse runs of slashes with a
+    single slash. (Mark DeSpain via ab)
+
+90. NUTCH-500 - Add hadoop masters configuration file into conf folder. 
+    (Emmanuel Joke via kubes)
+
+91. NUTCH-596 - ParseSegments parse content even if its not
+    CrawlDatum.STATUS_FETCH_SUCCESS (dogacan)
+    
+92. NUTCH-618 - Tika error "Media type alias already exists" (mattmann,kubes)
+
+93. NUTCH-634 - Upgrade Nutch to Hadoop 0.17.1 (Michael Gottesman, Lincoln
+    Ritter, ab)
+
+94. NUTCH-641 - IndexSorter inorrectly copies stored fields (ab)
+
+95. NUTCH-645 - Parse-swf unit test failing (ab)
+
+96. NUTCH-642 - Unit tests fail when run in non-local mode (ab)
+
+97. NUTCH-639 - Change LuceneDocumentWrapper visibility from
+    private to _public_ (Guillaume Smet via dogacan)
+
+98. NUTCH-651 - Remove bin/{start|stop}-balancer.sh from svn
+    tracking. (dogacan)
+
+99. NUTCH-375 - Add support for Content-Encoding: deflated
+    (Pascal Beis, ab)
+
+100. NUTCH-633 - ParseSegment no longer allow reparsing.
+     (dogacan)
+
+101. NUTCH-653 - Upgrade to hadoop 0.18. (dogacan)
+
+102. NUTCH-621 - Nutch needs to declare it's crypto usage (mattmann)
+
+103. NUTCH-654 - urlfilter-regex's main does not work.
+     (dogacan)
+
+104. NUTCH-640 - confusing description "set it to Integer.MAX_VALUE".
+     (dogacan)
+     
+105. NUTCH-662 - Upgrade Nutch to use Lucene 2.4. (kubes)
+
+106. NUTCH-663 - Upgrade Nutch to use Hadoop 0.19 (kubes)
+
+107. NUTCH-647 - Resolve URLs tool (kubes)
+
+108. NUTCH-665 - Search Load Testing Tool (kubes)
+
+109. NUTCH-667 - Input Format for working with Content in Hadoop Streaming
+                 (kubes)
+
+110. NUTCH-635 -  LinkAnalysis Tool for Nutch. (kubes)
+
+111. NUTCH-646 -  New Indexing Framework for Nutch. (kubes)
+
+112. NUTCH-668 -  Domain URL Filter. (kubes)
+
+113. NUTCH-594 -  Serve Nutch search results in multiple formats including 
+                  XML and JSON. (kubes)
+
+114. NUTCH-442 - Integrate Solr/Nutch. (dogacan, original version by siren) 
+
+115. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
+                 fetch interval correctly. (dogacan)
+
+116. NUTCH-627 - Minimize host address lookup (Otis Gospodnetic)
+
+117. NUTCH-678 - Hadoop 0.19 requires an update of jets3t.
+                 (julien nioche via dogacan)
+
+118. NUTCH-681 - parse-mp3 compilation problem. 
+                 (Wildan Maulana via dogacan)
+
+119. NUTCH-676 - MapWritable is written inefficiently and confusingly.
+                 (dogacan)
+
+120. NUTCH-579 - Feed plugin only indexes one post per feed due to identical
+                 digest. (dogacan)
+
+121. NUTCH-571 - parse-mp3 plugin doesn't always index album of mp3.
+                 (Joseph Chen, dogacan)
+
+122. NUTCH-682 - SOLR indexer does not set boost on the document.
+                 (julien nioche via dogacan)
+
+123. NUTCH-279 - Additions to urlnormalizer-regex (Stefan Neufeind, ab)
+
+124. NUTCH-671 - JSP errors in Nutch searcher webapp (Edwin Chu via ab)
+
+125. NUTCH-643 - ClassCastException in PDF parser (Guillaume Smet, ab)
+
+126. NUTCH-636 - Httpclient plugin https doesn't work on IBM JRE
+     (Curtis d'Entremont, ab)
+
+127. NUTCH-683 - NUTCH-676 broke CrawlDbMerger. (dogacan)
+
+128. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException
+     (Stefan Will, siren)
+     
+129. NUTCH-691 - Update jakarta poi jars to the most relevant version
+     (Dmitry Lihachev via siren)
+
+130. NUTCH-563 - Include custom fields in BasicQueryFilter
+     (Julien Nioche via siren)
+     
+131. NUTCH-695 - Incorrect mime type detection by MoreIndexingFilter plugin
+     (Dmitry Lihachev via siren)
+     
+132. NUTCH-694 - Distributed Search Server fails (siren)
+
+133. NUTCH-626 - Fetcher2 breaks out the domain with db.ignore.external.links
+     set at cross domain redirects (Remco Verhoef, dogacan via siren)
+
+134. NUTCH-247 - Robot parser to restrict (kubes, siren)
+
+135. NUTCH-698 - CrawlDb is corrupted after a few crawl cycles (dogacan
+     via siren)
+     
+136. NUTCH-699 - Add an "official" solr schema for solr integration (dogacan,
+     Dmitry Lihachev via siren)
+
+137. NUTCH-703 - Upgrade to Hadoop 0.19.1 (ab)
+
+138. NUTCH-419 - Unavailable robots.txt kills fetch (Carsten Lehmann,
+     Doug Cook via ab)
+     
+139. NUTCH-700 - Neko1.9.11 goes into a loop (Julien Nioche, siren)
+
+140. NUTCH-669 - Consolidate code for Fetcher and Fetcher2 (siren)
+
+141. NUTCH-711 - Indexer failing after upgrade to Hadoop 0.19.1 (ab)
+
+142. NUTCH-684 - Dedup support for Solr. (dogacan)
+
+143. NUTCH-715 - Subcollection plugin doesn't work with default
+     subcollections.xml file (Dmitry Lihachev via siren)
+     
+144. NUTCH-722 - Nutch contains JAI jars that we cannot redistribute
+
+Release 0.9 - 2007-04-02
+
+ 1. Changed log4j confiquration to log to stdout on commandline
+    tools (siren)
+
+ 2. NUTCH-344 - Fix for thread blocking issue (Greg Kim via siren)
+ 
+ 3. NUTCH-260 - Update hadoop version to 0.5.0 (Renaud Richardet,
+    siren)
+
+ 4. Optionally skip pages with abnormally large values of Crawl-Delay
+    (Dennis Kubes via ab)
+
+ 5. Change readdb -stats to use CombiningCollector (ab)
+
+ 6. NUTCH-348 - Fix Generator to select highest scoring pages (Chris
+    Schneider and Stefan Groschupf via ab)
+
+ 7. NUTCH-347 - Adjust plugin build script not to emit warnings when copying
+    dependant jars (siren)
+    
+ 8. NUTCH-338 - Remove the text parser as an option for parsing PDF files
+    in parse-plugins.xml (Chris A. Mattmann via siren)
+    
+ 9. NUTCH-105 - Network error during robots.txt fetch causes file to
+    be ignored (Greg Kim via siren)
+    
+10. NUTCH-367 - DistributedSearch thown ClassCastException (siren)
+
+11. NUTCH-332 - Fix the problem of doubling scores caused by links pointing
+    to the current page (e.g. anchors). (Stefan Groschupf via ab)
+
+12. NUTCH-365 - Flexible URL normalization (ab)
+
+13. NUTCH-336 - Differentiate between newly discovered pages and newly
+    injected pages (Chris Schneider via ab) NOTE: this changes the
+    scoring API, filter implementations need to be updated.
+
+14. NUTCH-337 - Fetcher ignores the fetcher.parse value (Stefan Groschupf
+    via ab)
+
+15. NUTCH-350 - Urls blocked by http.max.delays incorrectly marked as GONE
+    (Stefan Groschupf via ab)
+
+16. NUTCH-374 - when http.content.limit be set to -1 and  
+    Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any thing 
+    (King Kong via pkosiorowski)
+
+17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab)
+
+  ****************************** WARNING !!! ********************************
+  * This upgrade breaks data format compatibility. A tool 'convertdb'       *
+  * was added to migrate existing CrawlDb-s to the new format. Segment data *
+  * can be partially migrated using 'mergesegs', however segments will      *
+  * require re-parsing (and consequently re-indexing).                      *
+  ****************************** WARNING !!! ********************************
+
+18. NUTCH-371 - DeleteDuplicates now correctly implements both parts of
+    the algorithm. (ab)
+
+19. NUTCH-391 - ParseUtil logs file contents to log file when it cannot
+    find parser (siren)
+
+20. NUTCH-379 - ParseUtil does not pass through the content's URL to the
+    ParserFactory (Chris A. Mattmann via siren)
+
+21. NUTCH-361, NUTCH-136 - When jobtracker is 'local' generate only one
+    partition. (ab)
+
+22. NUTCH-399 - Change CommandRunner to use concurrent api from jdk (siren)
+
+23. NUTCH-395 - Increase fetching speed (siren)
+
+24. NUTCH-388 - nutch-default.xml has outdated example for urlfilter.order
+    (reported by Jared Dunne)
+
+25. NUTCH-404 - Fix LinkDB Usage - implementation mismatch (siren)
+


[... 726 lines stripped ...]

svn commit: r28631 [4/5] - in /release/nutch: 1.14/CHANGES.txt 1.15/CHANGES.txt

Reply via email to