Hi all
This is my crawl-urlfilter.txt ============================================================================== # The url filter file used by the crawl command. # Better for intranet crawling. # Be sure to change MY.DOMAIN.NAME to your domain name. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls # -^(file|ftp|mailto): -^(http|ftp|mailto): # skip image and other suffixes we can't yet parse # -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops # -.*(/.+?)/.*?\1/.*?\1/ # accept hosts in MY.DOMAIN.NAME # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ # skip everything else # -. # accept anything else +.* ============================================================================== This is my nutch-site.xml ============================================================================== <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>LocalSpider</value> <description></description> </property> <property> <name>plugin.folders</name> <value>/var/www/html/nutch9loc/plugins</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the class-path.</description> </property> <property> <name>plugin.includes</name> <value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)|scoring-opic|</value> </property> </configuration> ============================================================================== MY urls file contains this ============================================================================== smb://192.168.0.1/Softwares/ smb://192.168.0.1/ smb://192.168.0.101/BOOKS/ smb://192.168.0.101 smb:///192.168.0.101/books2/ smb:///192.168.0.101 ============================================================================== Two machines 192.168.0.1 and 192.168.0.101 having shares \Softwares (contaning Firefox, Adobe reader, Messengers, Apache, Mysql, java, *.zip, *.tar, etc.) \Books (contains .chm files, Tiff files, html files ) \books2 (contains .pdf files) http://www.nabble.com/file/p12267824/hadoop.zip hadoop.zip ATTACHED is my hadoop.log Can some one please tell me why my "protocol-smb" protocol is not working... In log it shows.. 2007-08-21 13:39:49,166 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.clustering.OnlineClusterer 2007-08-21 13:39:49,166 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.indexer.IndexingFilter 2007-08-21 13:39:49,166 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.ontology.Ontology 2007-08-21 13:39:49,166 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.parse.Parser 2007-08-21 13:39:49,167 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.parse.HtmlParseFilter 2007-08-21 13:39:49,167 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.protocol.Protocol 2007-08-21 13:39:49,167 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.searcher.QueryFilter 2007-08-21 13:39:49,167 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.net.URLFilter 2007-08-21 13:39:49,167 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.net.URLNormalizer 2007-08-21 13:39:49,167 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.analysis.NutchAnalyzer 2007-08-21 13:39:49,167 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.searcher.Summarizer 2007-08-21 13:39:49,168 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.scoring.ScoringFilter 2007-08-21 13:39:49,168 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2007-08-21 13:39:49,168 INFO plugin.PluginRepository - Registered Plugins: 2007-08-21 13:39:49,169 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2007-08-21 13:39:49,169 INFO plugin.PluginRepository - Site Query Filter (query-site) 2007-08-21 13:39:49,169 INFO plugin.PluginRepository - SMB Protocol Plug-in (protocol-smb) 2007-08-21 13:39:49,169 INFO plugin.PluginRepository - MSPowerPoint Parse Plug-in (parse-mspowerpoint) 2007-08-21 13:39:49,169 INFO plugin.PluginRepository - MSWord Parse Plug-in (parse-msword) 2007-08-21 13:39:49,169 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2007-08-21 13:39:49,170 INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf) 2007-08-21 13:39:49,170 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2007-08-21 13:39:49,170 INFO plugin.PluginRepository - File Protocol Plug-in (protocol-file) 2007-08-21 13:39:49,170 INFO plugin.PluginRepository - MSExcel Parse Plug-in (parse-msexcel) 2007-08-21 13:39:49,170 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2007-08-21 13:39:49,170 INFO plugin.PluginRepository - Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi) 2007-08-21 13:39:49,170 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2007-08-21 13:39:49,171 INFO plugin.PluginRepository - URL Query Filter (query-url) 2007-08-21 13:39:49,171 INFO plugin.PluginRepository - Parse MS Documents Framework (lib-parsems) 2007-08-21 13:39:49,171 INFO plugin.PluginRepository - Log4j (lib-log4j) 2007-08-21 13:39:49,171 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2007-08-21 13:39:49,171 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2007-08-21 13:39:49,172 INFO plugin.PluginRepository - Registered Extension-Points: 2007-08-21 13:39:49,172 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2007-08-21 13:39:49,172 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2007-08-21 13:39:49,172 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2007-08-21 13:39:49,172 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2007-08-21 13:39:49,172 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2007-08-21 13:39:49,173 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2007-08-21 13:39:49,173 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer) 2007-08-21 13:39:49,173 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2007-08-21 13:39:49,173 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2007-08-21 13:39:49,173 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology) 2007-08-21 13:39:49,173 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2007-08-21 13:39:49,173 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) when it has registered the smb protocol why this error 2007-08-21 13:39:56,269 INFO fetcher.Fetcher - fetching smb://192.168.0.1/ 2007-08-21 13:39:56,271 INFO fetcher.Fetcher - fetch of smb://192.168.0.1/ failed with: org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: unknown protocol: smb 2007-08-21 13:39:56,302 INFO fetcher.Fetcher - fetching smb:///192.168.0.101 2007-08-21 13:39:56,308 INFO fetcher.Fetcher - fetching smb://192.168.0.101 2007-08-21 13:39:56,314 INFO fetcher.Fetcher - fetching smb://192.168.0.1/Softwares/ 2007-08-21 13:39:56,315 INFO fetcher.Fetcher - fetch of smb:///192.168.0.101 failed with: org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: unknown protocol: smb 2007-08-21 13:39:56,316 INFO fetcher.Fetcher - fetching smb:///192.168.0.101/books2/ 2007-08-21 13:39:56,317 INFO fetcher.Fetcher - fetch of smb:///192.168.0.101/books2/ failed with: org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: unknown protocol: smb 2007-08-21 13:39:56,317 INFO fetcher.Fetcher - fetching smb://192.168.0.101/BOOKS/ 2007-08-21 13:39:56,346 INFO fetcher.Fetcher - fetch of smb://192.168.0.1/Softwares/ failed with: org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: unknown protocol: smb 2007-08-21 13:39:56,348 INFO fetcher.Fetcher - fetch of smb://192.168.0.101 failed with: org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: unknown protocol: smb 2007-08-21 13:39:56,351 INFO fetcher.Fetcher - fetch of smb://192.168.0.101/BOOKS/ failed with: org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: unknown protocol: smb 2007-08-21 13:39:56,869 INFO mapred.JobClient - map 0% reduce 0% 2007-08-21 13:39:56,968 INFO mapred.LocalJobRunner - file:/var/www/html/nutch9loc/localcrawl1/segments/20070821133952/crawl_generate/part-00000:0+579 2007-08-21 13:39:57,423 DEBUG mapred.MapTask - opened spill0.out What is wrong whith this??? Is my config files wrong ?? Please some one help me out here... Thanx Bikram -- View this message in context: http://www.nabble.com/Windows-Share-Crawling---searching-tf4277499.html#a12267824 Sent from the Nutch - User mailing list archive at Nabble.com.
