hi.. there was some problem in my config files and rectified it...
but still getting the same error This is some part of the log saying that NOT INCLUDING CERTAIN PLUGINS INCLUDING Protocol-smb... 2007-08-20 10:15:28,891 DEBUG plugin.PluginRepository - parsing: /var/www/html/nutch9loc/plugins/urlnormalizer-regex/plugin.xml 2007-08-20 10:15:28,918 DEBUG plugin.PluginRepository - plugin: id=urlnormalizer-regex name=Regex URL Normalizer version=1.0.0 provider=nutch.orgclass=null 2007-08-20 10:15:28,918 DEBUG plugin.PluginRepository - impl: point=org.apache.nutch.net.URLNormalizer class=org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer 2007-08-20 10:15:28,919 DEBUG plugin.PluginRepository - parsing: /var/www/html/nutch9loc/plugins/parse-rss/plugin.xml 2007-08-20 10:15:28,928 DEBUG plugin.PluginRepository - plugin: id=parse-rss name=RSS Parse Plug-in version=1.0.0 provider=edu.usc.cs.cs599class=null 2007-08-20 10:15:28,928 DEBUG plugin.PluginRepository - impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.rss.RSSParser 2007-08-20 10:15:28,929 DEBUG plugin.PluginRepository - parsing: /var/www/html/nutch9loc/plugins/creativecommons/plugin.xml 2007-08-20 10:15:28,940 DEBUG plugin.PluginRepository - plugin: id=creativecommons name=Creative Commons Plugins version=1.0.0 provider=nutch.orgclass=null 2007-08-20 10:15:28,941 DEBUG plugin.PluginRepository - impl: point=org.apache.nutch.parse.HtmlParseFilter class=org.creativecommons.nutch.CCParseFilter 2007-08-20 10:15:28,942 DEBUG plugin.PluginRepository - impl: point=org.apache.nutch.indexer.IndexingFilter class=org.creativecommons.nutch.CCIndexingFilter 2007-08-20 10:15:28,943 DEBUG plugin.PluginRepository - impl: point=org.apache.nutch.searcher.QueryFilter class=org.creativecommons.nutch.CCQueryFilter 2007-08-20 10:15:28,943 DEBUG plugin.PluginRepository - parsing: /var/www/html/nutch9loc/plugins/urlnormalizer-pass/plugin.xml 2007-08-20 10:15:28,955 DEBUG plugin.PluginRepository - plugin: id=urlnormalizer-pass name=Pass-through URL Normalizer version=1.0.0 provider=nutch.orgclass=null 2007-08-20 10:15:28,955 DEBUG plugin.PluginRepository - impl: point=org.apache.nutch.net.URLNormalizer class=org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer 2007-08-20 10:15:28,957 DEBUG plugin.PluginRepository - not including: creativecommons 2007-08-20 10:15:28,958 DEBUG plugin.PluginRepository - not including: subcollection 2007-08-20 10:15:28,958 DEBUG plugin.PluginRepository - not including: protocol-httpclient 2007-08-20 10:15:28,958 DEBUG plugin.PluginRepository - not including: lib-regex-filter 2007-08-20 10:15:28,958 DEBUG plugin.PluginRepository - not including: lib-lucene-analyzers 2007-08-20 10:15:28,959 DEBUG plugin.PluginRepository - not including: parse-pdf 2007-08-20 10:15:28,959 DEBUG plugin.PluginRepository - not including: parse-msexcel 2007-08-20 10:15:28,959 DEBUG plugin.PluginRepository - not including: lib-http 2007-08-20 10:15:28,959 DEBUG plugin.PluginRepository - not including: parse-swf 2007-08-20 10:15:28,959 DEBUG plugin.PluginRepository - not including: parse-ext 2007-08-20 10:15:28,960 DEBUG plugin.PluginRepository - not including: lib-log4j 2007-08-20 10:15:28,960 DEBUG plugin.PluginRepository - not including: ontology 2007-08-20 10:15:28,960 DEBUG plugin.PluginRepository - not including: protocol-ftp 2007-08-20 10:15:28,960 DEBUG plugin.PluginRepository - not including: parse-zip 2007-08-20 10:15:28,960 DEBUG plugin.PluginRepository - not including: nutch-extensionpoints 2007-08-20 10:15:28,961 DEBUG plugin.PluginRepository - not including: index-more 2007-08-20 10:15:28,961 DEBUG plugin.PluginRepository - not including: clustering-carrot2 2007-08-20 10:15:28,961 DEBUG plugin.PluginRepository - not including: urlfilter-suffix 2007-08-20 10:15:28,961 DEBUG plugin.PluginRepository - not including: query-more 2007-08-20 10:15:28,961 DEBUG plugin.PluginRepository - not including: microformats-reltag 2007-08-20 10:15:28,961 DEBUG plugin.PluginRepository - not including: language-identifier 2007-08-20 10:15:28,962 DEBUG plugin.PluginRepository - not including: urlfilter-prefix 2007-08-20 10:15:28,962 DEBUG plugin.PluginRepository - not including: lib-nekohtml 2007-08-20 10:15:28,962 DEBUG plugin.PluginRepository - not including: protocol-smb 2007-08-20 10:15:28,962 DEBUG plugin.PluginRepository - not including: parse-mspowerpoint 2007-08-20 10:15:28,962 DEBUG plugin.PluginRepository - not including: parse-msword 2007-08-20 10:15:28,962 DEBUG plugin.PluginRepository - not including: protocol-file 2007-08-20 10:15:28,963 DEBUG plugin.PluginRepository - not including: lib-jakarta-poi 2007-08-20 10:15:28,963 DEBUG plugin.PluginRepository - not including: lib-xml 2007-08-20 10:15:28,963 DEBUG plugin.PluginRepository - not including: lib-parsems 2007-08-20 10:15:28,963 DEBUG plugin.PluginRepository - not including: parse-rss 2007-08-20 10:15:28,963 DEBUG plugin.PluginRepository - not including: parse-oo 2007-08-20 10:15:28,964 DEBUG plugin.PluginRepository - not including: urlfilter-automaton 2007-08-20 10:15:28,964 DEBUG plugin.PluginRepository - not including: summary-lucene 2007-08-20 10:15:28,990 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.clustering.OnlineClusterer 2007-08-20 10:15:28,990 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.indexer.IndexingFilter 2007-08-20 10:15:28,990 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.ontology.Ontology 2007-08-20 10:15:28,990 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.parse.Parser 2007-08-20 10:15:28,991 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.parse.HtmlParseFilter 2007-08-20 10:15:28,991 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.protocol.Protocol 2007-08-20 10:15:28,991 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.searcher.QueryFilter 2007-08-20 10:15:28,991 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.net.URLFilter 2007-08-20 10:15:28,991 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.net.URLNormalizer 2007-08-20 10:15:28,991 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.analysis.NutchAnalyzer 2007-08-20 10:15:28,992 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.searcher.Summarizer 2007-08-20 10:15:28,992 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.scoring.ScoringFilter 2007-08-20 10:15:28,992 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2007-08-20 10:15:28,992 INFO plugin.PluginRepository - Registered Plugins: 2007-08-20 10:15:28,993 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2007-08-20 10:15:28,993 INFO plugin.PluginRepository - Site Query Filter (query-site) 2007-08-20 10:15:28,993 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2007-08-20 10:15:28,993 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2007-08-20 10:15:28,993 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2007-08-20 10:15:28,993 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2007-08-20 10:15:28,993 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2007-08-20 10:15:28,993 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2007-08-20 10:15:28,994 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2007-08-20 10:15:28,994 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2007-08-20 10:15:28,994 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2007-08-20 10:15:28,994 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2007-08-20 10:15:28,994 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2007-08-20 10:15:28,994 INFO plugin.PluginRepository - URL Query Filter (query-url) 2007-08-20 10:15:28,994 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2007-08-20 10:15:28,994 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2007-08-20 10:15:28,994 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2007-08-20 10:15:28,995 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2007-08-20 10:15:28,995 INFO plugin.PluginRepository - Registered Extension-Points: 2007-08-20 10:15:28,995 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2007-08-20 10:15:28,995 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2007-08-20 10:15:28,995 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2007-08-20 10:15:28,996 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2007-08-20 10:15:28,996 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2007-08-20 10:15:28,996 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2007-08-20 10:15:28,996 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer) 2007-08-20 10:15:28,996 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2007-08-20 10:15:28,996 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2007-08-20 10:15:28,997 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology) 2007-08-20 10:15:28,997 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2007-08-20 10:15:28,997 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) 2007-08-20 10:15:29,140 INFO conf.Configuration - found resource crawl-urlfilter.txt at file:/var/www/html/nutch9loc/conf/crawl-urlfilter.txt 2007-08-20 10:15:29,166 DEBUG api.RegexURLFilterBase - Adding rule [^(http|ftp|mailto):] 2007-08-20 10:15:29,171 DEBUG api.RegexURLFilterBase - Adding rule [\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$] 2007-08-20 10:15:29,172 DEBUG api.RegexURLFilterBase - Adding rule [EMAIL PROTECTED] 2007-08-20 10:15:29,172 DEBUG api.RegexURLFilterBase - Adding rule [.*] 2007-08-20 10:15:29,190 DEBUG mapred.MapTask - Started thread: Sort progress reporter for task map_g8nxpr 2007-08-20 10:15:29,191 INFO mapred.LocalJobRunner - file:/var/www/html/nutch9loc/urls/urls:0+205 2007-08-20 10:15:29,207 WARN crawl.Injector - Skipping smb://192.168.0.1/:java.net.MalformedURLException: unknown protocol: smb 2007-08-20 10:15:29,212 WARN crawl.Injector - Skipping smb://192.168.0.1:java.net.MalformedURLException: unknown protocol: smb 2007-08-20 10:15:29,213 WARN crawl.Injector - Skipping smb://192.168.0.101/:java.net.MalformedURLException: unknown protocol: smb 2007-08-20 10:15:29,214 WARN crawl.Injector - Skipping smb://192.168.0.101:java.net.MalformedURLException: unknown protocol: smb 2007-08-20 10:15:29,215 WARN crawl.Injector - Skipping smb:///192.168.0.101/:java.net.MalformedURLException: unknown protocol: smb 2007-08-20 10:15:29,216 WARN crawl.Injector - Skipping smb:///192.168.0.101:java.net.MalformedURLException: unknown protocol: smb 2007-08-20 10:15:29,217 WARN regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2007-08-20 10:15:29,231 DEBUG mapred.MapTask - opened spill0.out 2007-08-20 10:15:29,305 INFO mapred.LocalJobRunner - file:/var/www/html/nutch9loc/urls/urls:0+205 WHY is that it is not including these above mentioned plugins?? how can i make nutch include these plugins so that i can crawl windows file sharing... same error: Skipping smb://192.168.0.1:java.net.MalformedURLException: unknown protocol: smb Please help me.. thanx Bikram Renaud Richardet-4 wrote: > > hi Bikram, > > allright, you need to check the format of your config files > (nutch-default.xml and nutch-site.xml), they should both be well-formed > (opening-closing tags, etc.) xml, and have the correct structure: > > <?xml version="1.0"?> > <configuration> > <property> > <name>...</name> > <value>...</value> > <description>...</description> > </property> > </configuration> > > see the provided template for nutch-site.xml > > HTH, > Renaud > > > bikram wrote: >> Hi >> >> After rechecking the Log i noticed this error toooo.. >> >> 2007-08-20 12:11:32,449 FATAL conf.Configuration - bad conf file: >> top-level >> element not <configuration> >> 2007-08-20 12:11:32,450 WARN conf.Configuration - bad conf file: element >> not <property> >> 2007-08-20 12:11:32,490 INFO crawl.Crawl - crawl started in: localcrawl4 >> 2007-08-20 12:11:32,491 INFO crawl.Crawl - rootUrlDir = urls >> 2007-08-20 12:11:32,491 INFO crawl.Crawl - threads = 10 >> 2007-08-20 12:11:32,491 INFO crawl.Crawl - depth = 5 >> 2007-08-20 12:11:32,632 FATAL conf.Configuration - bad conf file: >> top-level >> element not <configuration> >> 2007-08-20 12:11:32,632 WARN conf.Configuration - bad conf file: element >> not <property> >> 2007-08-20 12:11:32,641 INFO crawl.Injector - Injector: starting >> 2007-08-20 12:11:32,642 INFO crawl.Injector - Injector: crawlDb: >> localcrawl4/crawldb >> 2007-08-20 12:11:32,642 INFO crawl.Injector - Injector: urlDir: urls >> 2007-08-20 12:11:32,643 INFO crawl.Injector - Injector: Converting >> injected >> urls to crawl db entries. >> 2007-08-20 12:11:32,643 DEBUG conf.Configuration - java.io.IOException: >> config(config) >> at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:102) >> at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:77) >> at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:88) >> at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:27) >> at org.apache.nutch.crawl.Injector.inject(Injector.java:152) >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:115) >> >> >> >> 2007-08-20 12:11:32,449 FATAL conf.Configuration - bad conf file: >> top-level >> element not <configuration> >> 2007-08-20 12:11:32,450 WARN conf.Configuration - bad conf file: element >> not <property> >> >> Which configuration file is this message referring too.. >> I doubled checked my conf files but was not able to make it out which >> file >> and what line is causing this error... >> >> Can someone help me to rectify these errors and make nutch work... >> >> >> thanx >> Bikram >> >> >> >> > > > -- View this message in context: http://www.nabble.com/Windows-Share-Crawling---searching-tf4277499.html#a12255199 Sent from the Nutch - User mailing list archive at Nabble.com.
