I am sending my Hadoop file and I apllied also patch559V0.5 at the time of fetching I am getting this messages --------------------------------------------------------- Fetcher: starting Fetcher: segment: crawl/segments/20080103125023 Fetcher: threads: 10 fetching http://www.w3schools.com/ http.proxy.host = netmon.iitb.ac.in http.proxy.port = 80 http.timeout = 100000 http.content.limit = 65536 http.agent = digi/Nutch-0.9 (digvijay; http://www.google.com; [EMAIL PROTECTED]) protocol.plugin.check.blocking = true protocol.plugin.check.robots = true fetcher.server.delay = 5000 http.max.delays = 100 Configured Client fetch of http://www.w3schools.com/ failed with: Http code=407, url= http://www.w3schools.com/ Fetcher: done
----------------------------------------------------------------------------
2008-01-03 12:50:04,275 INFO crawl.Injector - Injector: starting 2008-01-03 12:50:04,347 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2008-01-03 12:50:04,347 INFO crawl.Injector - Injector: urlDir: urls 2008-01-03 12:50:04,895 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2008-01-03 12:50:11,140 INFO plugin.PluginRepository - Plugins: looking in: /home/digvijay/Nutch/nutch-0.9/plugins 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Registered Plugins: 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - MSWord Parse Plug-in (parse-msword) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - XML Libraries (lib-xml) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - MSExcel Parse Plug-in (parse-msexcel) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Zip Parse Plug-in (parse-zip) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - URL Query Filter (query-url) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Parse MS Documents Framework (lib-parsems) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - MSPowerPoint Parse Plug-in (parse-mspowerpoint) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - RSS Parse Plug-in (parse-rss) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Site Query Filter (query-site) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - OpenOffice/OpenDocument Parse Plug-in (parse-oo) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Log4j (lib-log4j) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - SWF Parse Plug-in (parse-swf) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Registered Extension-Points: 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2008-01-03 12:50:12,174 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer) 2008-01-03 12:50:12,174 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2008-01-03 12:50:12,174 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2008-01-03 12:50:12,174 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2008-01-03 12:50:12,174 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) 2008-01-03 12:50:12,174 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology) 2008-01-03 12:50:12,381 WARN regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2008-01-03 12:50:13,048 INFO crawl.Injector - Injector: Merging injected urls into crawl db. 2008-01-03 12:50:19,684 INFO crawl.Injector - Injector: done 2008-01-03 12:50:23,951 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. 2008-01-03 12:50:23,952 INFO crawl.Generator - Generator: starting 2008-01-03 12:50:23,952 INFO crawl.Generator - Generator: segment: crawl/segments/20080103125023 2008-01-03 12:50:23,952 INFO crawl.Generator - Generator: filtering: true 2008-01-03 12:50:23,999 INFO crawl.Generator - Generator: jobtracker is 'local', generating exactly one partition. 2008-01-03 12:50:25,101 INFO plugin.PluginRepository - Plugins: looking in: /home/digvijay/Nutch/nutch-0.9/plugins 2008-01-03 12:50:25,251 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2008-01-03 12:50:25,251 INFO plugin.PluginRepository - Registered Plugins: 2008-01-03 12:50:25,251 INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf) 2008-01-03 12:50:25,251 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient) 2008-01-03 12:50:25,251 INFO plugin.PluginRepository - Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi) 2008-01-03 12:50:25,251 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2008-01-03 12:50:25,251 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2008-01-03 12:50:25,251 INFO plugin.PluginRepository - MSWord Parse Plug-in (parse-msword) 2008-01-03 12:50:25,251 INFO plugin.PluginRepository - XML Libraries (lib-xml) 2008-01-03 12:50:25,251 INFO plugin.PluginRepository - MSExcel Parse Plug-in (parse-msexcel) 2008-01-03 12:50:25,251 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2008-01-03 12:50:25,251 INFO plugin.PluginRepository - Zip Parse Plug-in (parse-zip) 2008-01-03 12:50:25,251 INFO plugin.PluginRepository - URL Query Filter (query-url) 2008-01-03 12:50:25,251 INFO plugin.PluginRepository - Parse MS Documents Framework (lib-parsems) 2008-01-03 12:50:25,251 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2008-01-03 12:50:25,251 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2008-01-03 12:50:25,251 INFO plugin.PluginRepository - MSPowerPoint Parse Plug-in (parse-mspowerpoint) 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - RSS Parse Plug-in (parse-rss) 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - Site Query Filter (query-site) 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - OpenOffice/OpenDocument Parse Plug-in (parse-oo) 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - Log4j (lib-log4j) 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - SWF Parse Plug-in (parse-swf) 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - Registered Extension-Points: 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2008-01-03 12:50:25,252 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2008-01-03 12:50:25,253 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2008-01-03 12:50:25,253 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2008-01-03 12:50:25,253 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2008-01-03 12:50:25,253 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2008-01-03 12:50:25,253 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer) 2008-01-03 12:50:25,253 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2008-01-03 12:50:25,253 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2008-01-03 12:50:25,253 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2008-01-03 12:50:25,253 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) 2008-01-03 12:50:25,253 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology) 2008-01-03 12:50:25,354 WARN regex.RegexURLNormalizer - can't find rules for scope 'partition', using default 2008-01-03 12:50:25,409 INFO plugin.PluginRepository - Plugins: looking in: /home/digvijay/Nutch/nutch-0.9/plugins 2008-01-03 12:50:25,544 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2008-01-03 12:50:25,544 INFO plugin.PluginRepository - Registered Plugins: 2008-01-03 12:50:25,544 INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf) 2008-01-03 12:50:25,544 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient) 2008-01-03 12:50:25,544 INFO plugin.PluginRepository - Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi) 2008-01-03 12:50:25,544 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2008-01-03 12:50:25,544 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2008-01-03 12:50:25,544 INFO plugin.PluginRepository - MSWord Parse Plug-in (parse-msword) 2008-01-03 12:50:25,544 INFO plugin.PluginRepository - XML Libraries (lib-xml) 2008-01-03 12:50:25,544 INFO plugin.PluginRepository - MSExcel Parse Plug-in (parse-msexcel) 2008-01-03 12:50:25,544 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2008-01-03 12:50:25,545 INFO plugin.PluginRepository - Zip Parse Plug-in (parse-zip) 2008-01-03 12:50:25,545 INFO plugin.PluginRepository - URL Query Filter (query-url) 2008-01-03 12:50:25,545 INFO plugin.PluginRepository - Parse MS Documents Framework (lib-parsems) 2008-01-03 12:50:25,545 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2008-01-03 12:50:25,545 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2008-01-03 12:50:25,545 INFO plugin.PluginRepository - MSPowerPoint Parse Plug-in (parse-mspowerpoint) 2008-01-03 12:50:25,545 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2008-01-03 12:50:25,545 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2008-01-03 12:50:25,545 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2008-01-03 12:50:25,545 INFO plugin.PluginRepository - RSS Parse Plug-in (parse-rss) 2008-01-03 12:50:25,545 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2008-01-03 12:50:25,545 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2008-01-03 12:50:25,545 INFO plugin.PluginRepository - Site Query Filter (query-site) 2008-01-03 12:50:25,545 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2008-01-03 12:50:25,545 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2008-01-03 12:50:25,546 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2008-01-03 12:50:25,546 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2008-01-03 12:50:25,546 INFO plugin.PluginRepository - OpenOffice/OpenDocument Parse Plug-in (parse-oo) 2008-01-03 12:50:25,546 INFO plugin.PluginRepository - Log4j (lib-log4j) 2008-01-03 12:50:25,546 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2008-01-03 12:50:25,546 INFO plugin.PluginRepository - SWF Parse Plug-in (parse-swf) 2008-01-03 12:50:25,546 INFO plugin.PluginRepository - Registered Extension-Points: 2008-01-03 12:50:25,546 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2008-01-03 12:50:25,546 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2008-01-03 12:50:25,546 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2008-01-03 12:50:25,546 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2008-01-03 12:50:25,546 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2008-01-03 12:50:25,546 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2008-01-03 12:50:25,546 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer) 2008-01-03 12:50:25,546 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2008-01-03 12:50:25,546 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2008-01-03 12:50:25,546 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2008-01-03 12:50:25,547 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) 2008-01-03 12:50:25,547 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology) 2008-01-03 12:50:26,043 INFO crawl.Generator - Generator: Partitioning selected urls by host, for politeness. 2008-01-03 12:50:26,327 INFO plugin.PluginRepository - Plugins: looking in: /home/digvijay/Nutch/nutch-0.9/plugins 2008-01-03 12:50:26,428 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2008-01-03 12:50:26,428 INFO plugin.PluginRepository - Registered Plugins: 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - MSWord Parse Plug-in (parse-msword) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - XML Libraries (lib-xml) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - MSExcel Parse Plug-in (parse-msexcel) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - Zip Parse Plug-in (parse-zip) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - URL Query Filter (query-url) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - Parse MS Documents Framework (lib-parsems) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - MSPowerPoint Parse Plug-in (parse-mspowerpoint) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - RSS Parse Plug-in (parse-rss) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2008-01-03 12:50:26,429 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - Site Query Filter (query-site) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - OpenOffice/OpenDocument Parse Plug-in (parse-oo) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - Log4j (lib-log4j) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - SWF Parse Plug-in (parse-swf) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - Registered Extension-Points: 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2008-01-03 12:50:26,430 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) 2008-01-03 12:50:26,431 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology) 2008-01-03 12:50:26,447 WARN regex.RegexURLNormalizer - can't find rules for scope 'partition', using default 2008-01-03 12:50:27,312 INFO crawl.Generator - Generator: done. 2008-01-03 12:50:38,553 INFO fetcher.Fetcher - Fetcher: starting 2008-01-03 12:50:38,554 INFO fetcher.Fetcher - Fetcher: segment: crawl/segments/20080103125023 2008-01-03 12:50:39,097 INFO fetcher.Fetcher - Fetcher: threads: 10 2008-01-03 12:50:39,111 INFO plugin.PluginRepository - Plugins: looking in: /home/digvijay/Nutch/nutch-0.9/plugins 2008-01-03 12:50:39,253 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2008-01-03 12:50:39,253 INFO plugin.PluginRepository - Registered Plugins: 2008-01-03 12:50:39,253 INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf) 2008-01-03 12:50:39,253 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient) 2008-01-03 12:50:39,253 INFO plugin.PluginRepository - Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi) 2008-01-03 12:50:39,253 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2008-01-03 12:50:39,254 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2008-01-03 12:50:39,254 INFO plugin.PluginRepository - MSWord Parse Plug-in (parse-msword) 2008-01-03 12:50:39,254 INFO plugin.PluginRepository - XML Libraries (lib-xml) 2008-01-03 12:50:39,254 INFO plugin.PluginRepository - MSExcel Parse Plug-in (parse-msexcel) 2008-01-03 12:50:39,254 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - Zip Parse Plug-in (parse-zip) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - URL Query Filter (query-url) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - Parse MS Documents Framework (lib-parsems) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - MSPowerPoint Parse Plug-in (parse-mspowerpoint) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - RSS Parse Plug-in (parse-rss) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - Site Query Filter (query-site) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - OpenOffice/OpenDocument Parse Plug-in (parse-oo) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - Log4j (lib-log4j) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - SWF Parse Plug-in (parse-swf) 2008-01-03 12:50:39,255 INFO plugin.PluginRepository - Registered Extension-Points: 2008-01-03 12:50:39,256 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2008-01-03 12:50:39,256 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2008-01-03 12:50:39,256 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2008-01-03 12:50:39,256 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2008-01-03 12:50:39,256 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2008-01-03 12:50:39,256 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2008-01-03 12:50:39,256 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer) 2008-01-03 12:50:39,256 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2008-01-03 12:50:39,256 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2008-01-03 12:50:39,256 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2008-01-03 12:50:39,256 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) 2008-01-03 12:50:39,256 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology) 2008-01-03 12:50:39,334 INFO fetcher.Fetcher - fetching http://www.w3schools.com/ 2008-01-03 12:50:39,498 FATAL api.RobotRulesParser - Agent we advertise (digi) not listed first in 'http.robots.agents' property! 2008-01-03 12:50:39,498 INFO httpclient.Http - http.proxy.host = netmon.iitb.ac.in 2008-01-03 12:50:39,498 INFO httpclient.Http - http.proxy.port = 80 2008-01-03 12:50:39,498 INFO httpclient.Http - http.timeout = 100000 2008-01-03 12:50:39,498 INFO httpclient.Http - http.content.limit = 65536 2008-01-03 12:50:39,498 INFO httpclient.Http - http.agent = digi/Nutch-0.9 (digvijay; http://www.google.com; [EMAIL PROTECTED]) 2008-01-03 12:50:39,499 INFO httpclient.Http - protocol.plugin.check.blocking = true 2008-01-03 12:50:39,499 INFO httpclient.Http - protocol.plugin.check.robots = true 2008-01-03 12:50:39,499 INFO httpclient.Http - fetcher.server.delay = 5000 2008-01-03 12:50:39,499 INFO httpclient.Http - http.max.delays = 100 2008-01-03 12:50:39,506 INFO httpclient.Http - Configured Client 2008-01-03 12:50:47,682 INFO auth.AuthChallengeProcessor - basic authentication scheme selected 2008-01-03 12:50:47,684 INFO httpclient.HttpMethodDirector - No credentials available for BASIC 'Squid proxy-caching web server'@netmon.iitb.ac.in:80 2008-01-03 12:50:57,359 INFO auth.AuthChallengeProcessor - basic authentication scheme selected 2008-01-03 12:50:57,359 INFO httpclient.HttpMethodDirector - No credentials available for BASIC 'Squid proxy-caching web server'@netmon.iitb.ac.in:80 2008-01-03 12:50:57,407 INFO fetcher.Fetcher - fetch of http://www.w3schools.com/ failed with: Http code=407, url=http://www.w3schools.com/ 2008-01-03 12:50:58,508 INFO plugin.PluginRepository - Plugins: looking in: /home/digvijay/Nutch/nutch-0.9/plugins 2008-01-03 12:50:58,639 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2008-01-03 12:50:58,639 INFO plugin.PluginRepository - Registered Plugins: 2008-01-03 12:50:58,639 INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf) 2008-01-03 12:50:58,639 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient) 2008-01-03 12:50:58,639 INFO plugin.PluginRepository - Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - MSWord Parse Plug-in (parse-msword) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - XML Libraries (lib-xml) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - MSExcel Parse Plug-in (parse-msexcel) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - Zip Parse Plug-in (parse-zip) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - URL Query Filter (query-url) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - Parse MS Documents Framework (lib-parsems) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - MSPowerPoint Parse Plug-in (parse-mspowerpoint) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - RSS Parse Plug-in (parse-rss) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2008-01-03 12:50:58,640 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - Site Query Filter (query-site) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - OpenOffice/OpenDocument Parse Plug-in (parse-oo) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - Log4j (lib-log4j) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - SWF Parse Plug-in (parse-swf) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - Registered Extension-Points: 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) 2008-01-03 12:50:58,641 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology) 2008-01-03 12:50:59,120 INFO fetcher.Fetcher - Fetcher: done
