Hi all


This is my  crawl-urlfilter.txt
==============================================================================
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
# -^(file|ftp|mailto):
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
#
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]


# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
# -.*(/.+?)/.*?\1/.*?\1/


# accept hosts in MY.DOMAIN.NAME
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/


# skip everything else
# -.

# accept anything else 
+.*

==============================================================================



This is my nutch-site.xml

==============================================================================

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->


<configuration>

<property>
 <name>http.agent.name</name>
 <value>LocalSpider</value>
 <description></description>
</property>


<property>
  <name>plugin.folders</name>
  <value>/var/www/html/nutch9loc/plugins</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the class-path.</description>
</property>

<property>
<name>plugin.includes</name> 
<value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)|scoring-opic|</value>
</property>

</configuration>

==============================================================================


MY urls file contains this

==============================================================================

smb://192.168.0.1/Softwares/
smb://192.168.0.1/
smb://192.168.0.101/BOOKS/
smb://192.168.0.101
smb:///192.168.0.101/books2/
smb:///192.168.0.101

==============================================================================

Two machines 192.168.0.1 and 192.168.0.101

having shares 

\Softwares  (contaning Firefox, Adobe reader, Messengers, Apache, Mysql,
java, *.zip, *.tar, etc.)
\Books (contains .chm files, Tiff files, html files )
\books2 (contains .pdf files)



http://www.nabble.com/file/p12267824/hadoop.zip hadoop.zip 



ATTACHED is my hadoop.log

Can some one please tell me why my "protocol-smb" protocol is not working...

In log it shows..


2007-08-21 13:39:49,166 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.clustering.OnlineClusterer
2007-08-21 13:39:49,166 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.indexer.IndexingFilter
2007-08-21 13:39:49,166 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.ontology.Ontology
2007-08-21 13:39:49,166 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.parse.Parser
2007-08-21 13:39:49,167 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.parse.HtmlParseFilter
2007-08-21 13:39:49,167 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.protocol.Protocol
2007-08-21 13:39:49,167 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.searcher.QueryFilter
2007-08-21 13:39:49,167 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.net.URLFilter
2007-08-21 13:39:49,167 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.net.URLNormalizer
2007-08-21 13:39:49,167 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.analysis.NutchAnalyzer
2007-08-21 13:39:49,167 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.searcher.Summarizer
2007-08-21 13:39:49,168 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.scoring.ScoringFilter
2007-08-21 13:39:49,168 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-08-21 13:39:49,168 INFO  plugin.PluginRepository - Registered Plugins:
2007-08-21 13:39:49,169 INFO  plugin.PluginRepository -         CyberNeko HTML
Parser (lib-nekohtml)
2007-08-21 13:39:49,169 INFO  plugin.PluginRepository -         Site Query 
Filter
(query-site)
2007-08-21 13:39:49,169 INFO  plugin.PluginRepository -         SMB Protocol
Plug-in (protocol-smb)
2007-08-21 13:39:49,169 INFO  plugin.PluginRepository -         MSPowerPoint 
Parse
Plug-in (parse-mspowerpoint)
2007-08-21 13:39:49,169 INFO  plugin.PluginRepository -         MSWord Parse
Plug-in (parse-msword)
2007-08-21 13:39:49,169 INFO  plugin.PluginRepository -         Html Parse 
Plug-in
(parse-html)
2007-08-21 13:39:49,170 INFO  plugin.PluginRepository -         Pdf Parse 
Plug-in
(parse-pdf)
2007-08-21 13:39:49,170 INFO  plugin.PluginRepository -         Basic Indexing
Filter (index-basic)
2007-08-21 13:39:49,170 INFO  plugin.PluginRepository -         File Protocol
Plug-in (protocol-file)
2007-08-21 13:39:49,170 INFO  plugin.PluginRepository -         MSExcel Parse
Plug-in (parse-msexcel)
2007-08-21 13:39:49,170 INFO  plugin.PluginRepository -         Text Parse 
Plug-in
(parse-text)
2007-08-21 13:39:49,170 INFO  plugin.PluginRepository -         Jakarta POI - 
Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-08-21 13:39:49,170 INFO  plugin.PluginRepository -         Basic Query 
Filter
(query-basic)
2007-08-21 13:39:49,171 INFO  plugin.PluginRepository -         URL Query Filter
(query-url)
2007-08-21 13:39:49,171 INFO  plugin.PluginRepository -         Parse MS 
Documents
Framework (lib-parsems)
2007-08-21 13:39:49,171 INFO  plugin.PluginRepository -         Log4j 
(lib-log4j)
2007-08-21 13:39:49,171 INFO  plugin.PluginRepository -         the nutch core
extension points (nutch-extensionpoints)
2007-08-21 13:39:49,171 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2007-08-21 13:39:49,172 INFO  plugin.PluginRepository - Registered
Extension-Points:
2007-08-21 13:39:49,172 INFO  plugin.PluginRepository -         Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-08-21 13:39:49,172 INFO  plugin.PluginRepository -         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-08-21 13:39:49,172 INFO  plugin.PluginRepository -         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-08-21 13:39:49,172 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-21 13:39:49,172 INFO  plugin.PluginRepository -         Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-08-21 13:39:49,173 INFO  plugin.PluginRepository -         HTML Parse 
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-08-21 13:39:49,173 INFO  plugin.PluginRepository -         Nutch Online 
Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-08-21 13:39:49,173 INFO  plugin.PluginRepository -         Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-08-21 13:39:49,173 INFO  plugin.PluginRepository -         Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-08-21 13:39:49,173 INFO  plugin.PluginRepository -         Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-08-21 13:39:49,173 INFO  plugin.PluginRepository -         Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-08-21 13:39:49,173 INFO  plugin.PluginRepository -         Nutch Query 
Filter
(org.apache.nutch.searcher.QueryFilter)


when it has registered the smb protocol why this error


2007-08-21 13:39:56,269 INFO  fetcher.Fetcher - fetching smb://192.168.0.1/
2007-08-21 13:39:56,271 INFO  fetcher.Fetcher - fetch of smb://192.168.0.1/
failed with: org.apache.nutch.protocol.ProtocolNotFound:
java.net.MalformedURLException: unknown protocol: smb
2007-08-21 13:39:56,302 INFO  fetcher.Fetcher - fetching
smb:///192.168.0.101
2007-08-21 13:39:56,308 INFO  fetcher.Fetcher - fetching smb://192.168.0.101
2007-08-21 13:39:56,314 INFO  fetcher.Fetcher - fetching
smb://192.168.0.1/Softwares/
2007-08-21 13:39:56,315 INFO  fetcher.Fetcher - fetch of
smb:///192.168.0.101 failed with:
org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException:
unknown protocol: smb
2007-08-21 13:39:56,316 INFO  fetcher.Fetcher - fetching
smb:///192.168.0.101/books2/
2007-08-21 13:39:56,317 INFO  fetcher.Fetcher - fetch of
smb:///192.168.0.101/books2/ failed with:
org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException:
unknown protocol: smb
2007-08-21 13:39:56,317 INFO  fetcher.Fetcher - fetching
smb://192.168.0.101/BOOKS/
2007-08-21 13:39:56,346 INFO  fetcher.Fetcher - fetch of
smb://192.168.0.1/Softwares/ failed with:
org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException:
unknown protocol: smb
2007-08-21 13:39:56,348 INFO  fetcher.Fetcher - fetch of smb://192.168.0.101
failed with: org.apache.nutch.protocol.ProtocolNotFound:
java.net.MalformedURLException: unknown protocol: smb
2007-08-21 13:39:56,351 INFO  fetcher.Fetcher - fetch of
smb://192.168.0.101/BOOKS/ failed with:
org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException:
unknown protocol: smb
2007-08-21 13:39:56,869 INFO  mapred.JobClient -  map 0% reduce 0%
2007-08-21 13:39:56,968 INFO  mapred.LocalJobRunner -
file:/var/www/html/nutch9loc/localcrawl1/segments/20070821133952/crawl_generate/part-00000:0+579
2007-08-21 13:39:57,423 DEBUG mapred.MapTask - opened spill0.out


What is wrong whith this???
Is my config files wrong ??

Please some one help me out here...

Thanx
Bikram





-- 
View this message in context: 
http://www.nabble.com/Windows-Share-Crawling---searching-tf4277499.html#a12267824
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to