My personal fav. list
In a day or so I will count all votes and post them.
NUTCH-141 jobdetails.jsp doesnt work on webbrowser "safari"
+1
NUTCH-140 Add alias capability in parse-plugins.xml file that
allows mimeType->extensionId mapping
NUTCH-139 Standard metadata property names in the ParseData metadata
+1
NUTCH-138 non-Latin-1 characters cannot be submitted for search
+1
NUTCH-137 footer is not displayed in search result page
NUTCH-136 mapreduce segment generator generates 50 % less than
excepted urls
NUTCH-34 Parsing different content formats
NUTCH-3 multi values of header discarded
+1
NUTCH-134 Summarizer doesn't select the best snippets
NUTCH-132 Add ability to sort on more than one column
NUTCH-131 Non-documented variable: mapred.child.heap.size
NUTCH-98 RobotRulesParser interprets robots.txt incorrectly
NUTCH-129 rtf-parser does not work when opened with wordpad files
and saved
NUTCH-120 one "bad" link on a page kills parsing
+1
NUTCH-128 second configuration nodes overwrites first node
NUTCH-127 uncorrect values using -du, or ls does not return items
NUTCH-126 Fetching via https does not work with a proxy (patch)
+1
NUTCH-125 OpenOffice Parser plugin
+1
NUTCH-110 OpenSearchServlet outputs illegal xml characters
+1
NUTCH-36 Chinese in Nutch
NUTCH-123 Cache.jsp some times generate NullPointerException
+1 (may already fixed)
NUTCH-39 pagination in search result
NUTCH-49 Flag for generate to fetch only new pages to complement
the -refetchonly flag
NUTCH-94 MapFile.Writer throwing 'File exists error'.
NUTCH-117 Crawl crashes with java.io.IOException: already exists: C:
\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
NUTCH-122 block numbers need a better random number generator
NUTCH-82 Nutch Commands should run on Windows without external tools
NUTCH-121 SegmentReader for mapred
NUTCH-119 Regexp to extract outlinks incorrect
+1
NUTCH-118 FAQ link points to invalid URL
NUTCH-115 jobtracker.jsp shows too much information
NUTCH-103 Vivisimo like treeview and url redirect
NUTCH-108 tasktracker crashs when reconnecting to a new jobtracker.
NUTCH-113 Disable permanent DNS-to-IP caching for JVM 1.4
NUTCH-111 ndfs.replication is not documented within the nutch-
default.xml configuration file.
NUTCH-100 New plugin urlfilter-db
+1
NUTCH-101 RobotRulesParser
NUTCH-96 MapFile.Writer throws directory exists exception if run
multiple times in the same JVM or server JVM.
NUTCH-106 Datanode corruption
NUTCH-105 Network error during robots.txt fetch causes file to be
ignored
NUTCH-104 Nutch query parser does not support CJK bi-gram
segmentation.
NUTCH-102 jobtracker does not start when webapps is in src
NUTCH-95 DeleteDuplicates depends on the order of input segments
NUTCH-92 DistributedSearch incorrectly scores results
NUTCH-87 Efficient site-specific crawling for a large number of sites
NUTCH-91 empty encoding causes exception
+1
NUTCH-90 reduce logging output of IndexSegment
NUTCH-52 Parser plugin for MS Excel files
NUTCH-86 LanguageIdentifier API enhancements
NUTCH-84 Fetcher for constrained crawls
NUTCH-74 French Analyzer Plugin
+1
NUTCH-83 Release deliverable as zip
NUTCH-81 Webapp only works when deployed in root
NUTCH-79 Fault tolerant searching.
NUTCH-64 no results after a restart of a search--server (without
tomcat restart)
NUTCH-76 NDFS DataNode advertises localhost as it's address
NUTCH-75 Patch for WebDBReader to get more detailed information
about WebDBs
NUTCH-73 A page for CSV results
NUTCH-72 Query basic filter with correction feature
NUTCH-70 duplicate pages - virtual hosts in db.
NUTCH-68 A tool to generate arbitrary fetchlists
+1
NUTCH-62 Add html META tag information into metaData in index-more
plugin
++1!
NUTCH-61 Adaptive re-fetch interval. Detecting umodified content
++1! but is it ready to us?
NUTCH-55 Create dmoz.org search plugin - incorporate the dmoz.org
title/category/description if available &
NUTCH-59 meta data support in webdb
NUTCH-25 needs 'character encoding' detector
NUTCH-44 too many search results
NUTCH-42 enhance search.jsp such that it can also returns XML
NUTCH-50 Benchmarks & Performance goals
NUTCH-13 If dns points to 127.0.0.1, the url is also crawled
NUTCH-48 "Did you mean" query enhancement/refignment feature request
+1
NUTCH-47 Configure host filter to do wildcard prefixes - *.redhat.com
NUTCH-45 Log corrupt segments in SegmentMergeTool
NUTCH-26 New Http Authentication mechanism
NUTCH-24 Cannot handle incorrectly cased Content-Type
NUTCH-23 content text/xml parser
NUTCH-18 Windows servers include illegal characters in URLs
NUTCH-16 boost documents matching a url pattern
NUTCH-14 NullPointerException NutchBean.getSummary
NUTCH-12 WebDBReader options to print incoming links