I can't seem to get the nightly build to work! It looks like an error that I
was getting under cygwin is also haunting me under BSD. Am I doing something
very wrong? I have tried this from scratch about two or three times now and I
still get this error in my hadoop.log:
Illegal Capacity: -1
In these posts I was trying the nightly build with cygwin:
http://www.mail-archive.com/[email protected]/msg08955.html
http://www.mail-archive.com/[email protected]/msg08950.html
Now, I have installed a nutch nightly under BSD as follows:
$ cd /usr/tmp2
$ svn co http://svn.apache.org/repos/asf/lucene/nutch/trunk -r \{2007-08-08\}
$ mv trunk nutch-trunk
$ cd nutch-trunk
$ ant clean
$ ant -verbose
set NUTCH_HOME to /usr/tmp2/nutch_trunk
modify conf/nutch-site.xml
modify conf/crawl-urlfilter.txt
modify conf/log4j.properties
Just to be sure, I also ran this in /usr/tmp2/nutch-trunk:
$ svn up -r HEAD
$ ant clean
$ ant
I am unable to do an "intranet" style crawl. Here's what it looks like on the
console:
$ bin/nutch crawl /usr/tmp2/urls.txt -dir /usr/tmp2/100sites -depth 4 -topN 5
crawl started in: /usr/tmp2/100sites
rootUrlDir = /usr/tmp2/urls.txt
threads = 10
depth = 4
topN = 5
Injector: starting
Injector: crawlDb: /usr/tmp2/100sites/crawldb
Injector: urlDir: /usr/tmp2/urls.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: /usr/tmp2/100sites/segments/20070809141119
Generator: filtering: false
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: /usr/tmp2/100sites/segments/20070809141119
Fetcher: threads: 10
fetching http://ec.europa.eu/grants/index_en.htm
fetching http://ec.europa.eu/information_society/media/index_en.htm
fetching http://filmfinancing.org/
fetching http://dedo.delaware.gov/filmoffice/default.shtml
fetching http://filmnanaimo.com/
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:499)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
Here's hadoop.log:
2007-08-09 13:55:57,428 INFO crawl.Crawl - crawl started in: /usr/tmp2/100sites
2007-08-09 13:55:57,430 INFO crawl.Crawl - rootUrlDir = /usr/tmp2/urls.txt
2007-08-09 13:55:57,431 INFO crawl.Crawl - threads = 10
2007-08-09 13:55:57,431 INFO crawl.Crawl - depth = 4
2007-08-09 13:55:57,431 INFO crawl.Crawl - topN = 5
2007-08-09 13:55:57,542 INFO crawl.Injector - Injector: starting
2007-08-09 13:55:57,542 INFO crawl.Injector - Injector: crawlDb:
/usr/tmp2/100sites/crawldb
2007-08-09 13:55:57,542 INFO crawl.Injector - Injector: urlDir:
/usr/tmp2/urls.txt
2007-08-09 13:55:57,543 INFO crawl.Injector - Injector: Converting injected
urls to crawl db entries.
2007-08-09 13:55:58,613 INFO plugin.PluginRepository - Plugins: looking in:
/usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:55:58,842 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2007-08-09 13:55:58,842 INFO plugin.PluginRepository - Registered Plugins:
2007-08-09 13:55:58,842 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-08-09 13:55:58,842 INFO plugin.PluginRepository - Site Query Filter
(query-site)
2007-08-09 13:55:58,842 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2007-08-09 13:55:58,842 INFO plugin.PluginRepository - Html Parse Plug-in
(parse-html)
2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Pass-through URL
Normalizer (urlnormalizer-pass)
2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Regex URL Filter
Framework (lib-regex-filter)
2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Feed
Parse/Index/Query Plug-in (feed)
2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Basic Summarizer
Plug-in (summary-basic)
2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Text Parse Plug-in
(parse-text)
2007-08-09 13:55:58,850 INFO plugin.PluginRepository - JavaScript Parser
(parse-js)
2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Basic Query Filter
(query-basic)
2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Regex URL Filter
(urlfilter-regex)
2007-08-09 13:55:58,850 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2007-08-09 13:55:58,850 INFO plugin.PluginRepository - XML Libraries
(lib-xml)
2007-08-09 13:55:58,850 INFO plugin.PluginRepository - URL Query Filter
(query-url)
2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2007-08-09 13:55:58,850 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-08-09 13:55:58,850 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-08-09 13:55:58,851 INFO plugin.PluginRepository - HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:55:58,979 WARN regex.RegexURLNormalizer - can't find rules for
scope 'inject', using default
2007-08-09 13:56:00,516 INFO crawl.Injector - Injector: Merging injected urls
into crawl db.
2007-08-09 13:56:01,723 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2007-08-09 13:56:02,640 INFO crawl.Injector - Injector: done
2007-08-09 13:56:03,643 INFO crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2007-08-09 13:56:03,644 INFO crawl.Generator - Generator: starting
2007-08-09 13:56:03,644 INFO crawl.Generator - Generator: segment:
/usr/tmp2/100sites/segments/20070809135603
2007-08-09 13:56:03,644 INFO crawl.Generator - Generator: filtering: false
2007-08-09 13:56:03,644 INFO crawl.Generator - Generator: topN: 5
2007-08-09 13:56:03,712 INFO crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2007-08-09 13:56:04,474 INFO plugin.PluginRepository - Plugins: looking in:
/usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:56:04,654 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2007-08-09 13:56:04,654 INFO plugin.PluginRepository - Registered Plugins:
2007-08-09 13:56:04,654 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-08-09 13:56:04,654 INFO plugin.PluginRepository - Site Query Filter
(query-site)
2007-08-09 13:56:04,654 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2007-08-09 13:56:04,654 INFO plugin.PluginRepository - Html Parse Plug-in
(parse-html)
2007-08-09 13:56:04,654 INFO plugin.PluginRepository - Pass-through URL
Normalizer (urlnormalizer-pass)
2007-08-09 13:56:04,654 INFO plugin.PluginRepository - Regex URL Filter
Framework (lib-regex-filter)
2007-08-09 13:56:04,654 INFO plugin.PluginRepository - Feed
Parse/Index/Query Plug-in (feed)
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Basic Summarizer
Plug-in (summary-basic)
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Text Parse Plug-in
(parse-text)
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - JavaScript Parser
(parse-js)
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Basic Query Filter
(query-basic)
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Regex URL Filter
(urlfilter-regex)
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - XML Libraries
(lib-xml)
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - URL Query Filter
(query-url)
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:56:04,656 INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-08-09 13:56:04,656 INFO plugin.PluginRepository - HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:56:04,656 INFO plugin.PluginRepository - Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:56:04,657 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:56:04,657 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:56:04,657 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:56:04,657 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:56:04,657 INFO plugin.PluginRepository - Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:56:04,703 INFO crawl.FetchScheduleFactory - Using FetchSchedule
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2007-08-09 13:56:04,704 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000.0
2007-08-09 13:56:04,705 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000.0
2007-08-09 13:56:04,712 WARN regex.RegexURLNormalizer - can't find rules for
scope 'partition', using default
2007-08-09 13:56:05,063 INFO plugin.PluginRepository - Plugins: looking in:
/usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:56:05,207 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2007-08-09 13:56:05,207 INFO plugin.PluginRepository - Registered Plugins:
2007-08-09 13:56:05,207 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-08-09 13:56:05,207 INFO plugin.PluginRepository - Site Query Filter
(query-site)
2007-08-09 13:56:05,207 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2007-08-09 13:56:05,207 INFO plugin.PluginRepository - Html Parse Plug-in
(parse-html)
2007-08-09 13:56:05,207 INFO plugin.PluginRepository - Pass-through URL
Normalizer (urlnormalizer-pass)
2007-08-09 13:56:05,207 INFO plugin.PluginRepository - Regex URL Filter
Framework (lib-regex-filter)
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Feed
Parse/Index/Query Plug-in (feed)
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Basic Summarizer
Plug-in (summary-basic)
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Text Parse Plug-in
(parse-text)
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - JavaScript Parser
(parse-js)
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Basic Query Filter
(query-basic)
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Regex URL Filter
(urlfilter-regex)
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - XML Libraries
(lib-xml)
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - URL Query Filter
(query-url)
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-08-09 13:56:05,209 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:56:05,209 INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-08-09 13:56:05,209 INFO plugin.PluginRepository - HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:56:05,209 INFO plugin.PluginRepository - Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:56:05,209 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:56:05,209 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:56:05,209 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:56:05,209 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:56:05,209 INFO plugin.PluginRepository - Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:56:05,261 INFO crawl.FetchScheduleFactory - Using FetchSchedule
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2007-08-09 13:56:05,261 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000.0
2007-08-09 13:56:05,261 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000.0
2007-08-09 13:56:06,425 INFO crawl.Generator - Generator: Partitioning
selected urls by host, for politeness.
2007-08-09 13:56:07,195 INFO plugin.PluginRepository - Plugins: looking in:
/usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:56:07,335 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2007-08-09 13:56:07,335 INFO plugin.PluginRepository - Registered Plugins:
2007-08-09 13:56:07,335 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Site Query Filter
(query-site)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Html Parse Plug-in
(parse-html)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Pass-through URL
Normalizer (urlnormalizer-pass)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Regex URL Filter
Framework (lib-regex-filter)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Feed
Parse/Index/Query Plug-in (feed)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Basic Summarizer
Plug-in (summary-basic)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Text Parse Plug-in
(parse-text)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - JavaScript Parser
(parse-js)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Basic Query Filter
(query-basic)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Regex URL Filter
(urlfilter-regex)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - XML Libraries
(lib-xml)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - URL Query Filter
(query-url)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2007-08-09 13:56:07,336 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-08-09 13:56:07,337 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-08-09 13:56:07,337 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-08-09 13:56:07,337 INFO plugin.PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-08-09 13:56:07,337 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:56:07,337 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-08-09 13:56:07,338 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:56:07,338 INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-08-09 13:56:07,338 INFO plugin.PluginRepository - HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:56:07,338 INFO plugin.PluginRepository - Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:56:07,338 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:56:07,338 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:56:07,338 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:56:07,339 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:56:07,339 INFO plugin.PluginRepository - Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:56:07,358 WARN regex.RegexURLNormalizer - can't find rules for
scope 'partition', using default
2007-08-09 13:56:08,184 INFO crawl.Generator - Generator: done.
2007-08-09 13:56:08,184 INFO fetcher.Fetcher - Fetcher: starting
2007-08-09 13:56:08,185 INFO fetcher.Fetcher - Fetcher: segment:
/usr/tmp2/100sites/segments/20070809135603
2007-08-09 13:56:08,988 INFO fetcher.Fetcher - Fetcher: threads: 10
2007-08-09 13:56:08,992 INFO plugin.PluginRepository - Plugins: looking in:
/usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:56:09,154 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2007-08-09 13:56:09,154 INFO plugin.PluginRepository - Registered Plugins:
2007-08-09 13:56:09,154 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-08-09 13:56:09,154 INFO plugin.PluginRepository - Site Query Filter
(query-site)
2007-08-09 13:56:09,155 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2007-08-09 13:56:09,155 INFO plugin.PluginRepository - Html Parse Plug-in
(parse-html)
2007-08-09 13:56:09,155 INFO plugin.PluginRepository - Pass-through URL
Normalizer (urlnormalizer-pass)
2007-08-09 13:56:09,155 INFO plugin.PluginRepository - Regex URL Filter
Framework (lib-regex-filter)
2007-08-09 13:56:09,155 INFO plugin.PluginRepository - Feed
Parse/Index/Query Plug-in (feed)
2007-08-09 13:56:09,155 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-08-09 13:56:09,155 INFO plugin.PluginRepository - Basic Summarizer
Plug-in (summary-basic)
2007-08-09 13:56:09,155 INFO plugin.PluginRepository - Text Parse Plug-in
(parse-text)
2007-08-09 13:56:09,155 INFO plugin.PluginRepository - JavaScript Parser
(parse-js)
2007-08-09 13:56:09,155 INFO plugin.PluginRepository - Basic Query Filter
(query-basic)
2007-08-09 13:56:09,156 INFO plugin.PluginRepository - Regex URL Filter
(urlfilter-regex)
2007-08-09 13:56:09,156 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2007-08-09 13:56:09,156 INFO plugin.PluginRepository - XML Libraries
(lib-xml)
2007-08-09 13:56:09,156 INFO plugin.PluginRepository - URL Query Filter
(query-url)
2007-08-09 13:56:09,156 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2007-08-09 13:56:09,156 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2007-08-09 13:56:09,156 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-08-09 13:56:09,156 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-08-09 13:56:09,156 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-08-09 13:56:09,156 INFO plugin.PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-08-09 13:56:09,156 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:56:09,156 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-08-09 13:56:09,157 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:56:09,157 INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-08-09 13:56:09,157 INFO plugin.PluginRepository - HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:56:09,157 INFO plugin.PluginRepository - Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:56:09,157 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:56:09,157 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:56:09,157 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:56:09,157 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:56:09,157 INFO plugin.PluginRepository - Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:56:09,211 INFO fetcher.Fetcher - fetching
http://ec.europa.eu/grants/index_en.htm
2007-08-09 13:56:09,213 INFO fetcher.Fetcher - fetching
http://ec.europa.eu/information_society/media/index_en.htm
2007-08-09 13:56:09,213 INFO fetcher.Fetcher - fetching
http://filmfinancing.org/
2007-08-09 13:56:09,240 FATAL api.RobotRulesParser - Agent we advertise
(currentNutch) not listed first in 'http.robots.agents' property!
2007-08-09 13:56:09,240 INFO http.Http - http.proxy.host = null
2007-08-09 13:56:09,241 INFO http.Http - http.proxy.port = 8080
2007-08-09 13:56:09,241 INFO http.Http - http.timeout = 10000
2007-08-09 13:56:09,244 INFO http.Http - http.content.limit = 65536
2007-08-09 13:56:09,244 INFO http.Http - http.agent =
currentNutch/Nutch-1.0-dev (crawler v0.9 from trunk; http://hopoo.dyndns.org;
kai(underscore)testing(att)yahoo(dotcom))
2007-08-09 13:56:09,244 INFO http.Http - protocol.plugin.check.blocking = true
2007-08-09 13:56:09,244 INFO http.Http - protocol.plugin.check.robots = true
2007-08-09 13:56:09,244 INFO http.Http - fetcher.server.delay = 3000
2007-08-09 13:56:09,245 INFO http.Http - http.max.delays = 100
2007-08-09 13:56:09,212 INFO fetcher.Fetcher - fetching http://filmnanaimo.com/
2007-08-09 13:56:09,213 INFO fetcher.Fetcher - fetching
http://dedo.delaware.gov/filmoffice/default.shtml
2007-08-09 13:56:09,249 FATAL api.RobotRulesParser - Agent we advertise
(currentNutch) not listed first in 'http.robots.agents' property!
2007-08-09 13:56:09,249 INFO http.Http - http.proxy.host = null
2007-08-09 13:56:09,249 INFO http.Http - http.proxy.port = 8080
2007-08-09 13:56:09,249 INFO http.Http - http.timeout = 10000
2007-08-09 13:56:09,249 INFO http.Http - http.content.limit = 65536
2007-08-09 13:56:09,249 INFO http.Http - http.agent =
currentNutch/Nutch-1.0-dev (crawler v0.9 from trunk; http://hopoo.dyndns.org;
kai(underscore)testing(att)yahoo(dotcom))
2007-08-09 13:56:09,249 INFO http.Http - protocol.plugin.check.blocking = true
2007-08-09 13:56:09,249 INFO http.Http - protocol.plugin.check.robots = true
2007-08-09 13:56:09,249 INFO http.Http - fetcher.server.delay = 3000
2007-08-09 13:56:09,258 INFO http.Http - http.max.delays = 100
2007-08-09 13:56:09,259 FATAL api.RobotRulesParser - Agent we advertise
(currentNutch) not listed first in 'http.robots.agents' property!
2007-08-09 13:56:09,259 INFO http.Http - http.proxy.host = null
2007-08-09 13:56:09,259 INFO http.Http - http.proxy.port = 8080
2007-08-09 13:56:09,259 INFO http.Http - http.timeout = 10000
2007-08-09 13:56:09,259 INFO http.Http - http.content.limit = 65536
2007-08-09 13:56:09,259 INFO http.Http - http.agent =
currentNutch/Nutch-1.0-dev (crawler v0.9 from trunk; http://hopoo.dyndns.org;
kai(underscore)testing(att)yahoo(dotcom))
2007-08-09 13:56:09,259 INFO http.Http - protocol.plugin.check.blocking = true
2007-08-09 13:56:09,259 INFO http.Http - protocol.plugin.check.robots = true
2007-08-09 13:56:09,259 INFO http.Http - fetcher.server.delay = 3000
2007-08-09 13:56:09,259 INFO http.Http - http.max.delays = 100
2007-08-09 13:56:10,094 WARN regex.RegexURLNormalizer - can't find rules for
scope 'outlink', using default
2007-08-09 13:56:10,123 INFO crawl.SignatureFactory - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2007-08-09 13:56:17,661 INFO plugin.PluginRepository - Plugins: looking in:
/usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:56:17,797 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2007-08-09 13:56:17,797 INFO plugin.PluginRepository - Registered Plugins:
2007-08-09 13:56:17,797 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Site Query Filter
(query-site)
2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Html Parse Plug-in
(parse-html)
2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Pass-through URL
Normalizer (urlnormalizer-pass)
2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Regex URL Filter
Framework (lib-regex-filter)
2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Feed
Parse/Index/Query Plug-in (feed)
2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Basic Summarizer
Plug-in (summary-basic)
2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Text Parse Plug-in
(parse-text)
2007-08-09 13:56:17,798 INFO plugin.PluginRepository - JavaScript Parser
(parse-js)
2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Basic Query Filter
(query-basic)
2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Regex URL Filter
(urlfilter-regex)
2007-08-09 13:56:17,798 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2007-08-09 13:56:17,798 INFO plugin.PluginRepository - XML Libraries
(lib-xml)
2007-08-09 13:56:17,798 INFO plugin.PluginRepository - URL Query Filter
(query-url)
2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2007-08-09 13:56:17,799 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2007-08-09 13:56:17,799 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-08-09 13:56:17,799 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-08-09 13:56:17,799 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-08-09 13:56:17,799 INFO plugin.PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-08-09 13:56:17,800 INFO plugin.PluginRepository - HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:56:18,911 WARN mapred.LocalJobRunner - job_blv6jf
java.lang.IllegalArgumentException: Illegal Capacity: -1
at java.util.ArrayList.<init>(ArrayList.java:111)
at
org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:149)
at
org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:94)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:311)
at
org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:41)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:326)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155)
____________________________________________________________________________________
Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user panel
and lay it on us. http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7