On 9/6/07, Bolle, Jeffrey F. <[EMAIL PROTECTED]> wrote: > Has there been any success with this? I am running into the exact same > problem, but on a Fedora 6 machine. > > <switch to after having a quick look at the source> > > This is the line that tries to create an arraylist with an initial > capacity of -1. How inappropriate. > List<Entry<Text, CrawlDatum>> targets = new ArrayList<Entry<Text, > CrawlDatum>>(outlinksToStore); > > Oh, outlinksToStore, that is likely related to the > db.max.outlinks.per.page configuration variable. Funny enough, that > was set to -1 since I want all outlinks to be processed and, as the > description says, if the value is >=0 at most db.max.outlinks.per.page > outlinks will be processed for a page; otherwise, all outlinks will be > processed. > > My crawl appears to be running just fine now, after I set a very large > value into db.max.outlinks.per.page. Someone should look into fixing > that. >
My bad :). I am going to enter a JIRA and commit a fix for this soon. > Jeff > > > -----Original Message----- > From: Kai_testing Middleton [mailto:[EMAIL PROTECTED] > Sent: Thursday, August 09, 2007 5:33 PM > To: nutch user > Subject: nutch nightly: IllegalArgumentException: Illegal Capacity: -1 > > I can't seem to get the nightly build to work! It looks like an error > that I was getting under cygwin is also haunting me under BSD. Am I > doing something very wrong? I have tried this from scratch about two > or three times now and I still get this error in my hadoop.log: > Illegal Capacity: -1 > > > In these posts I was trying the nightly build with cygwin: > > http://www.mail-archive.com/[email protected]/msg08955.html > > http://www.mail-archive.com/[email protected]/msg08950.html > > Now, I have installed a nutch nightly under BSD as follows: > > $ cd /usr/tmp2 > $ svn co http://svn.apache.org/repos/asf/lucene/nutch/trunk -r > \{2007-08-08\} > $ mv trunk nutch-trunk > $ cd nutch-trunk > $ ant clean > $ ant -verbose > set NUTCH_HOME to /usr/tmp2/nutch_trunk > modify conf/nutch-site.xml > modify conf/crawl-urlfilter.txt > modify conf/log4j.properties > > Just to be sure, I also ran this in /usr/tmp2/nutch-trunk: > $ svn up -r HEAD > $ ant clean > $ ant > > I am unable to do an "intranet" style crawl. Here's what it looks like > on the console: > > $ bin/nutch crawl /usr/tmp2/urls.txt -dir /usr/tmp2/100sites -depth 4 > -topN 5 > crawl started in: /usr/tmp2/100sites > rootUrlDir = /usr/tmp2/urls.txt > threads = 10 > depth = 4 > topN = 5 > Injector: starting > Injector: crawlDb: /usr/tmp2/100sites/crawldb > Injector: urlDir: /usr/tmp2/urls.txt > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: /usr/tmp2/100sites/segments/20070809141119 > Generator: filtering: false > Generator: topN: 5 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: /usr/tmp2/100sites/segments/20070809141119 > Fetcher: threads: 10 > fetching http://ec.europa.eu/grants/index_en.htm > fetching http://ec.europa.eu/information_society/media/index_en.htm > fetching http://filmfinancing.org/ > fetching http://dedo.delaware.gov/filmoffice/default.shtml > fetching http://filmnanaimo.com/ > Exception in thread "main" java.io.IOException: Job failed! > at > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:499) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) > > Here's hadoop.log: > > 2007-08-09 13:55:57,428 INFO crawl.Crawl - crawl started in: > /usr/tmp2/100sites > 2007-08-09 13:55:57,430 INFO crawl.Crawl - rootUrlDir = > /usr/tmp2/urls.txt > 2007-08-09 13:55:57,431 INFO crawl.Crawl - threads = 10 > 2007-08-09 13:55:57,431 INFO crawl.Crawl - depth = 4 > 2007-08-09 13:55:57,431 INFO crawl.Crawl - topN = 5 > 2007-08-09 13:55:57,542 INFO crawl.Injector - Injector: starting > 2007-08-09 13:55:57,542 INFO crawl.Injector - Injector: crawlDb: > /usr/tmp2/100sites/crawldb > 2007-08-09 13:55:57,542 INFO crawl.Injector - Injector: urlDir: > /usr/tmp2/urls.txt > 2007-08-09 13:55:57,543 INFO crawl.Injector - Injector: Converting > injected urls to crawl db entries. > 2007-08-09 13:55:58,613 INFO plugin.PluginRepository - Plugins: > looking in: /usr/tmp2/nutch-trunk/build/plugins > 2007-08-09 13:55:58,842 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2007-08-09 13:55:58,842 INFO plugin.PluginRepository - Registered > Plugins: > 2007-08-09 13:55:58,842 INFO plugin.PluginRepository - CyberNeko > HTML Parser (lib-nekohtml) > 2007-08-09 13:55:58,842 INFO plugin.PluginRepository - Site Query > Filter (query-site) > 2007-08-09 13:55:58,842 INFO plugin.PluginRepository - Basic URL > Normalizer (urlnormalizer-basic) > 2007-08-09 13:55:58,842 INFO plugin.PluginRepository - Html Parse > Plug-in (parse-html) > 2007-08-09 13:55:58,850 INFO plugin.PluginRepository - > Pass-through URL Normalizer (urlnormalizer-pass) > 2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Regex URL > Filter Framework (lib-regex-filter) > 2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Feed > Parse/Index/Query Plug-in (feed) > 2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Basic > Indexing Filter (index-basic) > 2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Basic > Summarizer Plug-in (summary-basic) > 2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Text Parse > Plug-in (parse-text) > 2007-08-09 13:55:58,850 INFO plugin.PluginRepository - JavaScript > Parser (parse-js) > 2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Basic Query > Filter (query-basic) > 2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Regex URL > Filter (urlfilter-regex) > 2007-08-09 13:55:58,850 INFO plugin.PluginRepository - HTTP > Framework (lib-http) > 2007-08-09 13:55:58,850 INFO plugin.PluginRepository - XML > Libraries (lib-xml) > 2007-08-09 13:55:58,850 INFO plugin.PluginRepository - URL Query > Filter (query-url) > 2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Regex URL > Normalizer (urlnormalizer-regex) > 2007-08-09 13:55:58,850 INFO plugin.PluginRepository - Http > Protocol Plug-in (protocol-http) > 2007-08-09 13:55:58,850 INFO plugin.PluginRepository - the nutch > core extension points (nutch-extensionpoints) > 2007-08-09 13:55:58,850 INFO plugin.PluginRepository - OPIC > Scoring Plug-in (scoring-opic) > 2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Registered > Extension-Points: > 2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch > Summarizer (org.apache.nutch.searcher.Summarizer) > 2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch > Scoring (org.apache.nutch.scoring.ScoringFilter) > 2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch > Protocol (org.apache.nutch.protocol.Protocol) > 2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch URL > Normalizer (org.apache.nutch.net.URLNormalizer) > 2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch URL > Filter (org.apache.nutch.net.URLFilter) > 2007-08-09 13:55:58,851 INFO plugin.PluginRepository - HTML Parse > Filter (org.apache.nutch.parse.HtmlParseFilter) > 2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch > Online Search Results Clustering Plugin > (org.apache.nutch.clustering.OnlineClusterer) > 2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch > Indexing Filter (org.apache.nutch.indexer.IndexingFilter) > 2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch > Content Parser (org.apache.nutch.parse.Parser) > 2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Ontology > Model Loader (org.apache.nutch.ontology.Ontology) > 2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch > Analysis (org.apache.nutch.analysis.NutchAnalyzer) > 2007-08-09 13:55:58,851 INFO plugin.PluginRepository - Nutch Query > Filter (org.apache.nutch.searcher.QueryFilter) > 2007-08-09 13:55:58,979 WARN regex.RegexURLNormalizer - can't find > rules for scope 'inject', using default > 2007-08-09 13:56:00,516 INFO crawl.Injector - Injector: Merging > injected urls into crawl db. > 2007-08-09 13:56:01,723 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes > where applicable > 2007-08-09 13:56:02,640 INFO crawl.Injector - Injector: done > 2007-08-09 13:56:03,643 INFO crawl.Generator - Generator: Selecting > best-scoring urls due for fetch. > 2007-08-09 13:56:03,644 INFO crawl.Generator - Generator: starting > 2007-08-09 13:56:03,644 INFO crawl.Generator - Generator: segment: > /usr/tmp2/100sites/segments/20070809135603 > 2007-08-09 13:56:03,644 INFO crawl.Generator - Generator: filtering: > false > 2007-08-09 13:56:03,644 INFO crawl.Generator - Generator: topN: 5 > 2007-08-09 13:56:03,712 INFO crawl.Generator - Generator: jobtracker > is 'local', generating exactly one partition. > 2007-08-09 13:56:04,474 INFO plugin.PluginRepository - Plugins: > looking in: /usr/tmp2/nutch-trunk/build/plugins > 2007-08-09 13:56:04,654 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2007-08-09 13:56:04,654 INFO plugin.PluginRepository - Registered > Plugins: > 2007-08-09 13:56:04,654 INFO plugin.PluginRepository - CyberNeko > HTML Parser (lib-nekohtml) > 2007-08-09 13:56:04,654 INFO plugin.PluginRepository - Site Query > Filter (query-site) > 2007-08-09 13:56:04,654 INFO plugin.PluginRepository - Basic URL > Normalizer (urlnormalizer-basic) > 2007-08-09 13:56:04,654 INFO plugin.PluginRepository - Html Parse > Plug-in (parse-html) > 2007-08-09 13:56:04,654 INFO plugin.PluginRepository - > Pass-through URL Normalizer (urlnormalizer-pass) > 2007-08-09 13:56:04,654 INFO plugin.PluginRepository - Regex URL > Filter Framework (lib-regex-filter) > 2007-08-09 13:56:04,654 INFO plugin.PluginRepository - Feed > Parse/Index/Query Plug-in (feed) > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Basic > Indexing Filter (index-basic) > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Basic > Summarizer Plug-in (summary-basic) > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Text Parse > Plug-in (parse-text) > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - JavaScript > Parser (parse-js) > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Basic Query > Filter (query-basic) > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Regex URL > Filter (urlfilter-regex) > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - HTTP > Framework (lib-http) > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - XML > Libraries (lib-xml) > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - URL Query > Filter (query-url) > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Regex URL > Normalizer (urlnormalizer-regex) > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Http > Protocol Plug-in (protocol-http) > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - the nutch > core extension points (nutch-extensionpoints) > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - OPIC > Scoring Plug-in (scoring-opic) > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Registered > Extension-Points: > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Nutch > Summarizer (org.apache.nutch.searcher.Summarizer) > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Nutch > Scoring (org.apache.nutch.scoring.ScoringFilter) > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Nutch > Protocol (org.apache.nutch.protocol.Protocol) > 2007-08-09 13:56:04,655 INFO plugin.PluginRepository - Nutch URL > Normalizer (org.apache.nutch.net.URLNormalizer) > 2007-08-09 13:56:04,656 INFO plugin.PluginRepository - Nutch URL > Filter (org.apache.nutch.net.URLFilter) > 2007-08-09 13:56:04,656 INFO plugin.PluginRepository - HTML Parse > Filter (org.apache.nutch.parse.HtmlParseFilter) > 2007-08-09 13:56:04,656 INFO plugin.PluginRepository - Nutch > Online Search Results Clustering Plugin > (org.apache.nutch.clustering.OnlineClusterer) > 2007-08-09 13:56:04,657 INFO plugin.PluginRepository - Nutch > Indexing Filter (org.apache.nutch.indexer.IndexingFilter) > 2007-08-09 13:56:04,657 INFO plugin.PluginRepository - Nutch > Content Parser (org.apache.nutch.parse.Parser) > 2007-08-09 13:56:04,657 INFO plugin.PluginRepository - Ontology > Model Loader (org.apache.nutch.ontology.Ontology) > 2007-08-09 13:56:04,657 INFO plugin.PluginRepository - Nutch > Analysis (org.apache.nutch.analysis.NutchAnalyzer) > 2007-08-09 13:56:04,657 INFO plugin.PluginRepository - Nutch Query > Filter (org.apache.nutch.searcher.QueryFilter) > 2007-08-09 13:56:04,703 INFO crawl.FetchScheduleFactory - Using > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > 2007-08-09 13:56:04,704 INFO crawl.AbstractFetchSchedule - > defaultInterval=2592000.0 > 2007-08-09 13:56:04,705 INFO crawl.AbstractFetchSchedule - > maxInterval=7776000.0 > 2007-08-09 13:56:04,712 WARN regex.RegexURLNormalizer - can't find > rules for scope 'partition', using default > 2007-08-09 13:56:05,063 INFO plugin.PluginRepository - Plugins: > looking in: /usr/tmp2/nutch-trunk/build/plugins > 2007-08-09 13:56:05,207 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2007-08-09 13:56:05,207 INFO plugin.PluginRepository - Registered > Plugins: > 2007-08-09 13:56:05,207 INFO plugin.PluginRepository - CyberNeko > HTML Parser (lib-nekohtml) > 2007-08-09 13:56:05,207 INFO plugin.PluginRepository - Site Query > Filter (query-site) > 2007-08-09 13:56:05,207 INFO plugin.PluginRepository - Basic URL > Normalizer (urlnormalizer-basic) > 2007-08-09 13:56:05,207 INFO plugin.PluginRepository - Html Parse > Plug-in (parse-html) > 2007-08-09 13:56:05,207 INFO plugin.PluginRepository - > Pass-through URL Normalizer (urlnormalizer-pass) > 2007-08-09 13:56:05,207 INFO plugin.PluginRepository - Regex URL > Filter Framework (lib-regex-filter) > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Feed > Parse/Index/Query Plug-in (feed) > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Basic > Indexing Filter (index-basic) > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Basic > Summarizer Plug-in (summary-basic) > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Text Parse > Plug-in (parse-text) > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - JavaScript > Parser (parse-js) > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Basic Query > Filter (query-basic) > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Regex URL > Filter (urlfilter-regex) > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - HTTP > Framework (lib-http) > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - XML > Libraries (lib-xml) > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - URL Query > Filter (query-url) > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Regex URL > Normalizer (urlnormalizer-regex) > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Http > Protocol Plug-in (protocol-http) > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - the nutch > core extension points (nutch-extensionpoints) > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - OPIC > Scoring Plug-in (scoring-opic) > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Registered > Extension-Points: > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Nutch > Summarizer (org.apache.nutch.searcher.Summarizer) > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Nutch > Scoring (org.apache.nutch.scoring.ScoringFilter) > 2007-08-09 13:56:05,208 INFO plugin.PluginRepository - Nutch > Protocol (org.apache.nutch.protocol.Protocol) > 2007-08-09 13:56:05,209 INFO plugin.PluginRepository - Nutch URL > Normalizer (org.apache.nutch.net.URLNormalizer) > 2007-08-09 13:56:05,209 INFO plugin.PluginRepository - Nutch URL > Filter (org.apache.nutch.net.URLFilter) > 2007-08-09 13:56:05,209 INFO plugin.PluginRepository - HTML Parse > Filter (org.apache.nutch.parse.HtmlParseFilter) > 2007-08-09 13:56:05,209 INFO plugin.PluginRepository - Nutch > Online Search Results Clustering Plugin > (org.apache.nutch.clustering.OnlineClusterer) > 2007-08-09 13:56:05,209 INFO plugin.PluginRepository - Nutch > Indexing Filter (org.apache.nutch.indexer.IndexingFilter) > 2007-08-09 13:56:05,209 INFO plugin.PluginRepository - Nutch > Content Parser (org.apache.nutch.parse.Parser) > 2007-08-09 13:56:05,209 INFO plugin.PluginRepository - Ontology > Model Loader (org.apache.nutch.ontology.Ontology) > 2007-08-09 13:56:05,209 INFO plugin.PluginRepository - Nutch > Analysis (org.apache.nutch.analysis.NutchAnalyzer) > 2007-08-09 13:56:05,209 INFO plugin.PluginRepository - Nutch Query > Filter (org.apache.nutch.searcher.QueryFilter) > 2007-08-09 13:56:05,261 INFO crawl.FetchScheduleFactory - Using > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > 2007-08-09 13:56:05,261 INFO crawl.AbstractFetchSchedule - > defaultInterval=2592000.0 > 2007-08-09 13:56:05,261 INFO crawl.AbstractFetchSchedule - > maxInterval=7776000.0 > 2007-08-09 13:56:06,425 INFO crawl.Generator - Generator: Partitioning > selected urls by host, for politeness. > 2007-08-09 13:56:07,195 INFO plugin.PluginRepository - Plugins: > looking in: /usr/tmp2/nutch-trunk/build/plugins > 2007-08-09 13:56:07,335 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2007-08-09 13:56:07,335 INFO plugin.PluginRepository - Registered > Plugins: > 2007-08-09 13:56:07,335 INFO plugin.PluginRepository - CyberNeko > HTML Parser (lib-nekohtml) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Site Query > Filter (query-site) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Basic URL > Normalizer (urlnormalizer-basic) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Html Parse > Plug-in (parse-html) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - > Pass-through URL Normalizer (urlnormalizer-pass) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Regex URL > Filter Framework (lib-regex-filter) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Feed > Parse/Index/Query Plug-in (feed) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Basic > Indexing Filter (index-basic) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Basic > Summarizer Plug-in (summary-basic) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Text Parse > Plug-in (parse-text) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - JavaScript > Parser (parse-js) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Basic Query > Filter (query-basic) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Regex URL > Filter (urlfilter-regex) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - HTTP > Framework (lib-http) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - XML > Libraries (lib-xml) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - URL Query > Filter (query-url) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Regex URL > Normalizer (urlnormalizer-regex) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - Http > Protocol Plug-in (protocol-http) > 2007-08-09 13:56:07,336 INFO plugin.PluginRepository - the nutch > core extension points (nutch-extensionpoints) > 2007-08-09 13:56:07,337 INFO plugin.PluginRepository - OPIC > Scoring Plug-in (scoring-opic) > 2007-08-09 13:56:07,337 INFO plugin.PluginRepository - Registered > Extension-Points: > 2007-08-09 13:56:07,337 INFO plugin.PluginRepository - Nutch > Summarizer (org.apache.nutch.searcher.Summarizer) > 2007-08-09 13:56:07,337 INFO plugin.PluginRepository - Nutch > Scoring (org.apache.nutch.scoring.ScoringFilter) > 2007-08-09 13:56:07,337 INFO plugin.PluginRepository - Nutch > Protocol (org.apache.nutch.protocol.Protocol) > 2007-08-09 13:56:07,338 INFO plugin.PluginRepository - Nutch URL > Normalizer (org.apache.nutch.net.URLNormalizer) > 2007-08-09 13:56:07,338 INFO plugin.PluginRepository - Nutch URL > Filter (org.apache.nutch.net.URLFilter) > 2007-08-09 13:56:07,338 INFO plugin.PluginRepository - HTML Parse > Filter (org.apache.nutch.parse.HtmlParseFilter) > 2007-08-09 13:56:07,338 INFO plugin.PluginRepository - Nutch > Online Search Results Clustering Plugin > (org.apache.nutch.clustering.OnlineClusterer) > 2007-08-09 13:56:07,338 INFO plugin.PluginRepository - Nutch > Indexing Filter (org.apache.nutch.indexer.IndexingFilter) > 2007-08-09 13:56:07,338 INFO plugin.PluginRepository - Nutch > Content Parser (org.apache.nutch.parse.Parser) > 2007-08-09 13:56:07,338 INFO plugin.PluginRepository - Ontology > Model Loader (org.apache.nutch.ontology.Ontology) > 2007-08-09 13:56:07,339 INFO plugin.PluginRepository - Nutch > Analysis (org.apache.nutch.analysis.NutchAnalyzer) > 2007-08-09 13:56:07,339 INFO plugin.PluginRepository - Nutch Query > Filter (org.apache.nutch.searcher.QueryFilter) > 2007-08-09 13:56:07,358 WARN regex.RegexURLNormalizer - can't find > rules for scope 'partition', using default > 2007-08-09 13:56:08,184 INFO crawl.Generator - Generator: done. > 2007-08-09 13:56:08,184 INFO fetcher.Fetcher - Fetcher: starting > 2007-08-09 13:56:08,185 INFO fetcher.Fetcher - Fetcher: segment: > /usr/tmp2/100sites/segments/20070809135603 > 2007-08-09 13:56:08,988 INFO fetcher.Fetcher - Fetcher: threads: 10 > 2007-08-09 13:56:08,992 INFO plugin.PluginRepository - Plugins: > looking in: /usr/tmp2/nutch-trunk/build/plugins > 2007-08-09 13:56:09,154 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2007-08-09 13:56:09,154 INFO plugin.PluginRepository - Registered > Plugins: > 2007-08-09 13:56:09,154 INFO plugin.PluginRepository - CyberNeko > HTML Parser (lib-nekohtml) > 2007-08-09 13:56:09,154 INFO plugin.PluginRepository - Site Query > Filter (query-site) > 2007-08-09 13:56:09,155 INFO plugin.PluginRepository - Basic URL > Normalizer (urlnormalizer-basic) > 2007-08-09 13:56:09,155 INFO plugin.PluginRepository - Html Parse > Plug-in (parse-html) > 2007-08-09 13:56:09,155 INFO plugin.PluginRepository - > Pass-through URL Normalizer (urlnormalizer-pass) > 2007-08-09 13:56:09,155 INFO plugin.PluginRepository - Regex URL > Filter Framework (lib-regex-filter) > 2007-08-09 13:56:09,155 INFO plugin.PluginRepository - Feed > Parse/Index/Query Plug-in (feed) > 2007-08-09 13:56:09,155 INFO plugin.PluginRepository - Basic > Indexing Filter (index-basic) > 2007-08-09 13:56:09,155 INFO plugin.PluginRepository - Basic > Summarizer Plug-in (summary-basic) > 2007-08-09 13:56:09,155 INFO plugin.PluginRepository - Text Parse > Plug-in (parse-text) > 2007-08-09 13:56:09,155 INFO plugin.PluginRepository - JavaScript > Parser (parse-js) > 2007-08-09 13:56:09,155 INFO plugin.PluginRepository - Basic Query > Filter (query-basic) > 2007-08-09 13:56:09,156 INFO plugin.PluginRepository - Regex URL > Filter (urlfilter-regex) > 2007-08-09 13:56:09,156 INFO plugin.PluginRepository - HTTP > Framework (lib-http) > 2007-08-09 13:56:09,156 INFO plugin.PluginRepository - XML > Libraries (lib-xml) > 2007-08-09 13:56:09,156 INFO plugin.PluginRepository - URL Query > Filter (query-url) > 2007-08-09 13:56:09,156 INFO plugin.PluginRepository - Regex URL > Normalizer (urlnormalizer-regex) > 2007-08-09 13:56:09,156 INFO plugin.PluginRepository - Http > Protocol Plug-in (protocol-http) > 2007-08-09 13:56:09,156 INFO plugin.PluginRepository - the nutch > core extension points (nutch-extensionpoints) > 2007-08-09 13:56:09,156 INFO plugin.PluginRepository - OPIC > Scoring Plug-in (scoring-opic) > 2007-08-09 13:56:09,156 INFO plugin.PluginRepository - Registered > Extension-Points: > 2007-08-09 13:56:09,156 INFO plugin.PluginRepository - Nutch > Summarizer (org.apache.nutch.searcher.Summarizer) > 2007-08-09 13:56:09,156 INFO plugin.PluginRepository - Nutch > Scoring (org.apache.nutch.scoring.ScoringFilter) > 2007-08-09 13:56:09,156 INFO plugin.PluginRepository - Nutch > Protocol (org.apache.nutch.protocol.Protocol) > 2007-08-09 13:56:09,157 INFO plugin.PluginRepository - Nutch URL > Normalizer (org.apache.nutch.net.URLNormalizer) > 2007-08-09 13:56:09,157 INFO plugin.PluginRepository - Nutch URL > Filter (org.apache.nutch.net.URLFilter) > 2007-08-09 13:56:09,157 INFO plugin.PluginRepository - HTML Parse > Filter (org.apache.nutch.parse.HtmlParseFilter) > 2007-08-09 13:56:09,157 INFO plugin.PluginRepository - Nutch > Online Search Results Clustering Plugin > (org.apache.nutch.clustering.OnlineClusterer) > 2007-08-09 13:56:09,157 INFO plugin.PluginRepository - Nutch > Indexing Filter (org.apache.nutch.indexer.IndexingFilter) > 2007-08-09 13:56:09,157 INFO plugin.PluginRepository - Nutch > Content Parser (org.apache.nutch.parse.Parser) > 2007-08-09 13:56:09,157 INFO plugin.PluginRepository - Ontology > Model Loader (org.apache.nutch.ontology.Ontology) > 2007-08-09 13:56:09,157 INFO plugin.PluginRepository - Nutch > Analysis (org.apache.nutch.analysis.NutchAnalyzer) > 2007-08-09 13:56:09,157 INFO plugin.PluginRepository - Nutch Query > Filter (org.apache.nutch.searcher.QueryFilter) > 2007-08-09 13:56:09,211 INFO fetcher.Fetcher - fetching > http://ec.europa.eu/grants/index_en.htm > 2007-08-09 13:56:09,213 INFO fetcher.Fetcher - fetching > http://ec.europa.eu/information_society/media/index_en.htm > 2007-08-09 13:56:09,213 INFO fetcher.Fetcher - fetching > http://filmfinancing.org/ > 2007-08-09 13:56:09,240 FATAL api.RobotRulesParser - Agent we advertise > (currentNutch) not listed first in 'http.robots.agents' property! > 2007-08-09 13:56:09,240 INFO http.Http - http.proxy.host = null > 2007-08-09 13:56:09,241 INFO http.Http - http.proxy.port = 8080 > 2007-08-09 13:56:09,241 INFO http.Http - http.timeout = 10000 > 2007-08-09 13:56:09,244 INFO http.Http - http.content.limit = 65536 > 2007-08-09 13:56:09,244 INFO http.Http - http.agent = > currentNutch/Nutch-1.0-dev (crawler v0.9 from trunk; > http://hopoo.dyndns.org; kai(underscore)testing(att)yahoo(dotcom)) > 2007-08-09 13:56:09,244 INFO http.Http - > protocol.plugin.check.blocking = true > 2007-08-09 13:56:09,244 INFO http.Http - protocol.plugin.check.robots > = true > 2007-08-09 13:56:09,244 INFO http.Http - fetcher.server.delay = 3000 > 2007-08-09 13:56:09,245 INFO http.Http - http.max.delays = 100 > 2007-08-09 13:56:09,212 INFO fetcher.Fetcher - fetching > http://filmnanaimo.com/ > 2007-08-09 13:56:09,213 INFO fetcher.Fetcher - fetching > http://dedo.delaware.gov/filmoffice/default.shtml > 2007-08-09 13:56:09,249 FATAL api.RobotRulesParser - Agent we advertise > (currentNutch) not listed first in 'http.robots.agents' property! > 2007-08-09 13:56:09,249 INFO http.Http - http.proxy.host = null > 2007-08-09 13:56:09,249 INFO http.Http - http.proxy.port = 8080 > 2007-08-09 13:56:09,249 INFO http.Http - http.timeout = 10000 > 2007-08-09 13:56:09,249 INFO http.Http - http.content.limit = 65536 > 2007-08-09 13:56:09,249 INFO http.Http - http.agent = > currentNutch/Nutch-1.0-dev (crawler v0.9 from trunk; > http://hopoo.dyndns.org; kai(underscore)testing(att)yahoo(dotcom)) > 2007-08-09 13:56:09,249 INFO http.Http - > protocol.plugin.check.blocking = true > 2007-08-09 13:56:09,249 INFO http.Http - protocol.plugin.check.robots > = true > 2007-08-09 13:56:09,249 INFO http.Http - fetcher.server.delay = 3000 > 2007-08-09 13:56:09,258 INFO http.Http - http.max.delays = 100 > 2007-08-09 13:56:09,259 FATAL api.RobotRulesParser - Agent we advertise > (currentNutch) not listed first in 'http.robots.agents' property! > 2007-08-09 13:56:09,259 INFO http.Http - http.proxy.host = null > 2007-08-09 13:56:09,259 INFO http.Http - http.proxy.port = 8080 > 2007-08-09 13:56:09,259 INFO http.Http - http.timeout = 10000 > 2007-08-09 13:56:09,259 INFO http.Http - http.content.limit = 65536 > 2007-08-09 13:56:09,259 INFO http.Http - http.agent = > currentNutch/Nutch-1.0-dev (crawler v0.9 from trunk; > http://hopoo.dyndns.org; kai(underscore)testing(att)yahoo(dotcom)) > 2007-08-09 13:56:09,259 INFO http.Http - > protocol.plugin.check.blocking = true > 2007-08-09 13:56:09,259 INFO http.Http - protocol.plugin.check.robots > = true > 2007-08-09 13:56:09,259 INFO http.Http - fetcher.server.delay = 3000 > 2007-08-09 13:56:09,259 INFO http.Http - http.max.delays = 100 > 2007-08-09 13:56:10,094 WARN regex.RegexURLNormalizer - can't find > rules for scope 'outlink', using default > 2007-08-09 13:56:10,123 INFO crawl.SignatureFactory - Using Signature > impl: org.apache.nutch.crawl.MD5Signature > 2007-08-09 13:56:17,661 INFO plugin.PluginRepository - Plugins: > looking in: /usr/tmp2/nutch-trunk/build/plugins > 2007-08-09 13:56:17,797 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2007-08-09 13:56:17,797 INFO plugin.PluginRepository - Registered > Plugins: > 2007-08-09 13:56:17,797 INFO plugin.PluginRepository - CyberNeko > HTML Parser (lib-nekohtml) > 2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Site Query > Filter (query-site) > 2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Basic URL > Normalizer (urlnormalizer-basic) > 2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Html Parse > Plug-in (parse-html) > 2007-08-09 13:56:17,798 INFO plugin.PluginRepository - > Pass-through URL Normalizer (urlnormalizer-pass) > 2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Regex URL > Filter Framework (lib-regex-filter) > 2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Feed > Parse/Index/Query Plug-in (feed) > 2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Basic > Indexing Filter (index-basic) > 2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Basic > Summarizer Plug-in (summary-basic) > 2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Text Parse > Plug-in (parse-text) > 2007-08-09 13:56:17,798 INFO plugin.PluginRepository - JavaScript > Parser (parse-js) > 2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Basic Query > Filter (query-basic) > 2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Regex URL > Filter (urlfilter-regex) > 2007-08-09 13:56:17,798 INFO plugin.PluginRepository - HTTP > Framework (lib-http) > 2007-08-09 13:56:17,798 INFO plugin.PluginRepository - XML > Libraries (lib-xml) > 2007-08-09 13:56:17,798 INFO plugin.PluginRepository - URL Query > Filter (query-url) > 2007-08-09 13:56:17,798 INFO plugin.PluginRepository - Regex URL > Normalizer (urlnormalizer-regex) > 2007-08-09 13:56:17,799 INFO plugin.PluginRepository - Http > Protocol Plug-in (protocol-http) > 2007-08-09 13:56:17,799 INFO plugin.PluginRepository - the nutch > core extension points (nutch-extensionpoints) > 2007-08-09 13:56:17,799 INFO plugin.PluginRepository - OPIC > Scoring Plug-in (scoring-opic) > 2007-08-09 13:56:17,799 INFO plugin.PluginRepository - Registered > Extension-Points: > 2007-08-09 13:56:17,799 INFO plugin.PluginRepository - Nutch > Summarizer (org.apache.nutch.searcher.Summarizer) > 2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch > Scoring (org.apache.nutch.scoring.ScoringFilter) > 2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch > Protocol (org.apache.nutch.protocol.Protocol) > 2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch URL > Normalizer (org.apache.nutch.net.URLNormalizer) > 2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch URL > Filter (org.apache.nutch.net.URLFilter) > 2007-08-09 13:56:17,800 INFO plugin.PluginRepository - HTML Parse > Filter (org.apache.nutch.parse.HtmlParseFilter) > 2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch > Online Search Results Clustering Plugin > (org.apache.nutch.clustering.OnlineClusterer) > 2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch > Indexing Filter (org.apache.nutch.indexer.IndexingFilter) > 2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch > Content Parser (org.apache.nutch.parse.Parser) > 2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Ontology > Model Loader (org.apache.nutch.ontology.Ontology) > 2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch > Analysis (org.apache.nutch.analysis.NutchAnalyzer) > 2007-08-09 13:56:17,800 INFO plugin.PluginRepository - Nutch Query > Filter (org.apache.nutch.searcher.QueryFilter) > 2007-08-09 13:56:18,911 WARN mapred.LocalJobRunner - job_blv6jf > java.lang.IllegalArgumentException: Illegal Capacity: -1 > at java.util.ArrayList.<init>(ArrayList.java:111) > at > org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java > :149) > at > org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputForma > t.java:94) > at > org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:311) > at > org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.jav > a:41) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:326) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155 > ) > > > > > > > _______________________________________________________________________ > _____________ > Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user > panel and lay it on us. > http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 > -- Doğacan Güney
