I really wonder if this is some kind of nutch + cygwin error. Check this out. I change the paths to windows-like paths (not the cygwin mounted paths -- maybe the /cygwin/c mount point is the problem). Note that I use forward slashes in the windows-like paths: I no longer get the "Input path doesnt exist" error, though I still get a failure.
[EMAIL PROTECTED] /cygdrive/c/nutch-2007-07-26_04-01-20/logs $ nutch crawl C:/nutch-2007-07-26_04-01-20/content/urls.txt -dir c:/nutch-2007-07-26_04-01-20/content/sf911truth -depth 3 -topN 200 crawl started in: c:/nutch-2007-07-26_04-01-20/content/sf911truth rootUrlDir = C:/nutch-2007-07-26_04-01-20/content/urls.txt threads = 10 depth = 3 topN = 200 Injector: starting Injector: crawlDb: c:/nutch-2007-07-26_04-01-20/content/sf911truth/crawldb Injector: urlDir: C:/nutch-2007-07-26_04-01-20/content/urls.txt Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: c:/nutch-2007-07-26_04-01-20/content/sf911truth/segments/20070727003008 Generator: filtering: false Generator: topN: 200 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: c:/nutch-2007-07-26_04-01-20/content/sf911truth/segments/20070727003008 Fetcher: threads: 10 fetching http://www.sf911truth.org/ Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:499) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) [EMAIL PROTECTED] /cygdrive/c/nutch-2007-07-26_04-01-20/logs $ cat hadoop.log 2007-07-27 00:30:03,171 INFO crawl.Crawl - crawl started in: c:/nutch-2007-07-26_04-01-20/content/sf911truth 2007-07-27 00:30:03,187 INFO crawl.Crawl - rootUrlDir = C:/nutch-2007-07-26_04-01-20/content/urls.txt 2007-07-27 00:30:03,187 INFO crawl.Crawl - threads = 10 2007-07-27 00:30:03,187 INFO crawl.Crawl - depth = 3 2007-07-27 00:30:03,187 INFO crawl.Crawl - topN = 200 2007-07-27 00:30:03,281 INFO crawl.Injector - Injector: starting 2007-07-27 00:30:03,281 INFO crawl.Injector - Injector: crawlDb: c:/nutch-2007-07-26_04-01-20/content/sf911truth/crawld b 2007-07-27 00:30:03,281 INFO crawl.Injector - Injector: urlDir: C:/nutch-2007-07-26_04-01-20/content/urls.txt 2007-07-27 00:30:03,296 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2007-07-27 00:30:04,031 INFO plugin.PluginRepository - Plugins: looking in: C:\nutch-2007-07-26_04-01-20\plugins 2007-07-27 00:30:04,296 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2007-07-27 00:30:04,296 INFO plugin.PluginRepository - Registered Plugins: 2007-07-27 00:30:04,296 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2007-07-27 00:30:04,296 INFO plugin.PluginRepository - Site Query Filter (query-site) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Feed Parse/Index/Query Plug-in (feed) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - XML Libraries (lib-xml) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - URL Query Filter (query-url) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Registered Extension-Points: 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer ) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilte r) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apach e.nutch.clustering.OnlineClusterer) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.Indexing Filter) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontolog y) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilte r) 2007-07-27 00:30:04,375 WARN regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2007-07-27 00:30:06,046 INFO crawl.Injector - Injector: Merging injected urls into crawl db. 2007-07-27 00:30:06,640 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using bu iltin-java classes where applicable 2007-07-27 00:30:07,500 INFO crawl.Injector - Injector: done 2007-07-27 00:30:08,500 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. 2007-07-27 00:30:08,500 INFO crawl.Generator - Generator: starting 2007-07-27 00:30:08,500 INFO crawl.Generator - Generator: segment: c:/nutch-2007-07-26_04-01-20/content/sf911truth/segm ents/20070727003008 2007-07-27 00:30:08,500 INFO crawl.Generator - Generator: filtering: false 2007-07-27 00:30:08,500 INFO crawl.Generator - Generator: topN: 200 2007-07-27 00:30:08,531 INFO crawl.Generator - Generator: jobtracker is 'local', generating exactly one partition. 2007-07-27 00:30:08,984 INFO plugin.PluginRepository - Plugins: looking in: C:\nutch-2007-07-26_04-01-20\plugins 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Registered Plugins: 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Site Query Filter (query-site) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Feed Parse/Index/Query Plug-in (feed) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - XML Libraries (lib-xml) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - URL Query Filter (query-url) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Registered Extension-Points: 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer ) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilte r) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apach e.nutch.clustering.OnlineClusterer) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.Indexing Filter) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontolog y) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilte r) 2007-07-27 00:30:09,218 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetch Schedule 2007-07-27 00:30:09,218 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000.0 2007-07-27 00:30:09,218 INFO crawl.AbstractFetchSchedule - maxInterval=7776000.0 2007-07-27 00:30:09,234 WARN regex.RegexURLNormalizer - can't find rules for scope 'partition', using default 2007-07-27 00:30:09,296 INFO plugin.PluginRepository - Plugins: looking in: C:\nutch-2007-07-26_04-01-20\plugins 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Registered Plugins: 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Site Query Filter (query-site) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Feed Parse/Index/Query Plug-in (feed) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - XML Libraries (lib-xml) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - URL Query Filter (query-url) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Registered Extension-Points: 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer ) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilte r) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apach e.nutch.clustering.OnlineClusterer) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.Indexing Filter) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontolog y) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilte r) 2007-07-27 00:30:09,500 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetch Schedule 2007-07-27 00:30:09,500 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000.0 2007-07-27 00:30:09,500 INFO crawl.AbstractFetchSchedule - maxInterval=7776000.0 2007-07-27 00:30:10,187 INFO crawl.Generator - Generator: Partitioning selected urls by host, for politeness. 2007-07-27 00:30:10,687 INFO plugin.PluginRepository - Plugins: looking in: C:\nutch-2007-07-26_04-01-20\plugins 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Registered Plugins: 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Site Query Filter (query-site) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Feed Parse/Index/Query Plug-in (feed) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - XML Libraries (lib-xml) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - URL Query Filter (query-url) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Registered Extension-Points: 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer ) 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2007-07-27 00:30:10,875 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilte r) 2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apach e.nutch.clustering.OnlineClusterer) 2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.Indexing Filter) 2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontolog y) 2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilte r) 2007-07-27 00:30:10,890 WARN regex.RegexURLNormalizer - can't find rules for scope 'partition', using default 2007-07-27 00:30:11,625 INFO crawl.Generator - Generator: done. 2007-07-27 00:30:11,625 INFO fetcher.Fetcher - Fetcher: starting 2007-07-27 00:30:11,625 INFO fetcher.Fetcher - Fetcher: segment: c:/nutch-2007-07-26_04-01-20/content/sf911truth/segmen ts/20070727003008 2007-07-27 00:30:12,078 INFO fetcher.Fetcher - Fetcher: threads: 10 2007-07-27 00:30:12,093 INFO plugin.PluginRepository - Plugins: looking in: C:\nutch-2007-07-26_04-01-20\plugins 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Registered Plugins: 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Site Query Filter (query-site) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Feed Parse/Index/Query Plug-in (feed) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - XML Libraries (lib-xml) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - URL Query Filter (query-url) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Registered Extension-Points: 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer ) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilte r) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apach e.nutch.clustering.OnlineClusterer) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.Indexing Filter) 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2007-07-27 00:30:12,234 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontolog y) 2007-07-27 00:30:12,234 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2007-07-27 00:30:12,234 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilte r) 2007-07-27 00:30:12,265 INFO fetcher.Fetcher - fetching http://www.sf911truth.org/ 2007-07-27 00:30:12,312 FATAL api.RobotRulesParser - Agent we advertise (microlith-nutch) not listed first in 'http.robo ts.agents' property! 2007-07-27 00:30:12,312 INFO http.Http - http.proxy.host = null 2007-07-27 00:30:12,312 INFO http.Http - http.proxy.port = 8080 2007-07-27 00:30:12,312 INFO http.Http - http.timeout = 10000 2007-07-27 00:30:12,312 INFO http.Http - http.content.limit = 65536 2007-07-27 00:30:12,312 INFO http.Http - http.agent = microlith-nutch/Nutch-1.0-dev (crawler nutch-2007-07-26_04-01-20; http://hopoo.dyndns.org; kai(underscore)testing(att)yahoo(dotcom)) 2007-07-27 00:30:12,312 INFO http.Http - protocol.plugin.check.blocking = true 2007-07-27 00:30:12,312 INFO http.Http - protocol.plugin.check.robots = true 2007-07-27 00:30:12,312 INFO http.Http - fetcher.server.delay = 3000 2007-07-27 00:30:12,312 INFO http.Http - http.max.delays = 100 2007-07-27 00:30:13,578 WARN regex.RegexURLNormalizer - can't find rules for scope 'outlink', using default 2007-07-27 00:30:13,640 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature 2007-07-27 00:30:14,406 INFO plugin.PluginRepository - Plugins: looking in: C:\nutch-2007-07-26_04-01-20\plugins 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Registered Plugins: 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Site Query Filter (query-site) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Feed Parse/Index/Query Plug-in (feed) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - XML Libraries (lib-xml) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - URL Query Filter (query-url) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Registered Extension-Points: 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer ) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilte r) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apach e.nutch.clustering.OnlineClusterer) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.Indexing Filter) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontolog y) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilte r) 2007-07-27 00:30:14,718 WARN mapred.LocalJobRunner - job_8r2j8 java.lang.IllegalArgumentException: Illegal Capacity: -1 at java.util.ArrayList.<init>(ArrayList.java:111) at org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:149) at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:94) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:311) at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:41) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:326) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155) ____________________________________________________________________________________Ready for the edge of your seat? Check out tonight's top picks on Yahoo! TV. http://tv.yahoo.com/
