hi..

there was some problem in my config files and rectified it...

but still getting the same error

This is some part of the log saying that NOT INCLUDING CERTAIN PLUGINS
INCLUDING Protocol-smb...

2007-08-20 10:15:28,891 DEBUG plugin.PluginRepository - parsing:
/var/www/html/nutch9loc/plugins/urlnormalizer-regex/plugin.xml
2007-08-20 10:15:28,918 DEBUG plugin.PluginRepository - plugin:
id=urlnormalizer-regex name=Regex URL Normalizer version=1.0.0
provider=nutch.orgclass=null
2007-08-20 10:15:28,918 DEBUG plugin.PluginRepository - impl:
point=org.apache.nutch.net.URLNormalizer
class=org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
2007-08-20 10:15:28,919 DEBUG plugin.PluginRepository - parsing:
/var/www/html/nutch9loc/plugins/parse-rss/plugin.xml
2007-08-20 10:15:28,928 DEBUG plugin.PluginRepository - plugin: id=parse-rss
name=RSS Parse Plug-in version=1.0.0 provider=edu.usc.cs.cs599class=null
2007-08-20 10:15:28,928 DEBUG plugin.PluginRepository - impl:
point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.rss.RSSParser
2007-08-20 10:15:28,929 DEBUG plugin.PluginRepository - parsing:
/var/www/html/nutch9loc/plugins/creativecommons/plugin.xml
2007-08-20 10:15:28,940 DEBUG plugin.PluginRepository - plugin:
id=creativecommons name=Creative Commons Plugins version=1.0.0
provider=nutch.orgclass=null
2007-08-20 10:15:28,941 DEBUG plugin.PluginRepository - impl:
point=org.apache.nutch.parse.HtmlParseFilter
class=org.creativecommons.nutch.CCParseFilter
2007-08-20 10:15:28,942 DEBUG plugin.PluginRepository - impl:
point=org.apache.nutch.indexer.IndexingFilter
class=org.creativecommons.nutch.CCIndexingFilter
2007-08-20 10:15:28,943 DEBUG plugin.PluginRepository - impl:
point=org.apache.nutch.searcher.QueryFilter
class=org.creativecommons.nutch.CCQueryFilter
2007-08-20 10:15:28,943 DEBUG plugin.PluginRepository - parsing:
/var/www/html/nutch9loc/plugins/urlnormalizer-pass/plugin.xml
2007-08-20 10:15:28,955 DEBUG plugin.PluginRepository - plugin:
id=urlnormalizer-pass name=Pass-through URL Normalizer version=1.0.0
provider=nutch.orgclass=null
2007-08-20 10:15:28,955 DEBUG plugin.PluginRepository - impl:
point=org.apache.nutch.net.URLNormalizer
class=org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer
2007-08-20 10:15:28,957 DEBUG plugin.PluginRepository - not including:
creativecommons
2007-08-20 10:15:28,958 DEBUG plugin.PluginRepository - not including:
subcollection
2007-08-20 10:15:28,958 DEBUG plugin.PluginRepository - not including:
protocol-httpclient
2007-08-20 10:15:28,958 DEBUG plugin.PluginRepository - not including:
lib-regex-filter
2007-08-20 10:15:28,958 DEBUG plugin.PluginRepository - not including:
lib-lucene-analyzers
2007-08-20 10:15:28,959 DEBUG plugin.PluginRepository - not including:
parse-pdf
2007-08-20 10:15:28,959 DEBUG plugin.PluginRepository - not including:
parse-msexcel
2007-08-20 10:15:28,959 DEBUG plugin.PluginRepository - not including:
lib-http
2007-08-20 10:15:28,959 DEBUG plugin.PluginRepository - not including:
parse-swf
2007-08-20 10:15:28,959 DEBUG plugin.PluginRepository - not including:
parse-ext
2007-08-20 10:15:28,960 DEBUG plugin.PluginRepository - not including:
lib-log4j
2007-08-20 10:15:28,960 DEBUG plugin.PluginRepository - not including:
ontology
2007-08-20 10:15:28,960 DEBUG plugin.PluginRepository - not including:
protocol-ftp
2007-08-20 10:15:28,960 DEBUG plugin.PluginRepository - not including:
parse-zip
2007-08-20 10:15:28,960 DEBUG plugin.PluginRepository - not including:
nutch-extensionpoints
2007-08-20 10:15:28,961 DEBUG plugin.PluginRepository - not including:
index-more
2007-08-20 10:15:28,961 DEBUG plugin.PluginRepository - not including:
clustering-carrot2
2007-08-20 10:15:28,961 DEBUG plugin.PluginRepository - not including:
urlfilter-suffix
2007-08-20 10:15:28,961 DEBUG plugin.PluginRepository - not including:
query-more
2007-08-20 10:15:28,961 DEBUG plugin.PluginRepository - not including:
microformats-reltag
2007-08-20 10:15:28,961 DEBUG plugin.PluginRepository - not including:
language-identifier
2007-08-20 10:15:28,962 DEBUG plugin.PluginRepository - not including:
urlfilter-prefix
2007-08-20 10:15:28,962 DEBUG plugin.PluginRepository - not including:
lib-nekohtml
2007-08-20 10:15:28,962 DEBUG plugin.PluginRepository - not including:
protocol-smb
2007-08-20 10:15:28,962 DEBUG plugin.PluginRepository - not including:
parse-mspowerpoint
2007-08-20 10:15:28,962 DEBUG plugin.PluginRepository - not including:
parse-msword
2007-08-20 10:15:28,962 DEBUG plugin.PluginRepository - not including:
protocol-file
2007-08-20 10:15:28,963 DEBUG plugin.PluginRepository - not including:
lib-jakarta-poi
2007-08-20 10:15:28,963 DEBUG plugin.PluginRepository - not including:
lib-xml
2007-08-20 10:15:28,963 DEBUG plugin.PluginRepository - not including:
lib-parsems
2007-08-20 10:15:28,963 DEBUG plugin.PluginRepository - not including:
parse-rss
2007-08-20 10:15:28,963 DEBUG plugin.PluginRepository - not including:
parse-oo
2007-08-20 10:15:28,964 DEBUG plugin.PluginRepository - not including:
urlfilter-automaton
2007-08-20 10:15:28,964 DEBUG plugin.PluginRepository - not including:
summary-lucene
2007-08-20 10:15:28,990 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.clustering.OnlineClusterer
2007-08-20 10:15:28,990 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.indexer.IndexingFilter
2007-08-20 10:15:28,990 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.ontology.Ontology
2007-08-20 10:15:28,990 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.parse.Parser
2007-08-20 10:15:28,991 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.parse.HtmlParseFilter
2007-08-20 10:15:28,991 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.protocol.Protocol
2007-08-20 10:15:28,991 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.searcher.QueryFilter
2007-08-20 10:15:28,991 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.net.URLFilter
2007-08-20 10:15:28,991 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.net.URLNormalizer
2007-08-20 10:15:28,991 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.analysis.NutchAnalyzer
2007-08-20 10:15:28,992 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.searcher.Summarizer
2007-08-20 10:15:28,992 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.scoring.ScoringFilter
2007-08-20 10:15:28,992 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-08-20 10:15:28,992 INFO  plugin.PluginRepository - Registered Plugins:
2007-08-20 10:15:28,993 INFO  plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-08-20 10:15:28,993 INFO  plugin.PluginRepository - Site Query Filter
(query-site)
2007-08-20 10:15:28,993 INFO  plugin.PluginRepository - Basic URL Normalizer
(urlnormalizer-basic)
2007-08-20 10:15:28,993 INFO  plugin.PluginRepository - Html Parse Plug-in
(parse-html)
2007-08-20 10:15:28,993 INFO  plugin.PluginRepository - Pass-through URL
Normalizer (urlnormalizer-pass)
2007-08-20 10:15:28,993 INFO  plugin.PluginRepository - Regex URL Filter
Framework (lib-regex-filter)
2007-08-20 10:15:28,993 INFO  plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-08-20 10:15:28,993 INFO  plugin.PluginRepository - Basic Summarizer
Plug-in (summary-basic)
2007-08-20 10:15:28,994 INFO  plugin.PluginRepository - Text Parse Plug-in
(parse-text)
2007-08-20 10:15:28,994 INFO  plugin.PluginRepository - JavaScript Parser
(parse-js)
2007-08-20 10:15:28,994 INFO  plugin.PluginRepository - Regex URL Filter
(urlfilter-regex)
2007-08-20 10:15:28,994 INFO  plugin.PluginRepository - Basic Query Filter
(query-basic)
2007-08-20 10:15:28,994 INFO  plugin.PluginRepository - HTTP Framework
(lib-http)
2007-08-20 10:15:28,994 INFO  plugin.PluginRepository - URL Query Filter
(query-url)
2007-08-20 10:15:28,994 INFO  plugin.PluginRepository - Regex URL Normalizer
(urlnormalizer-regex)
2007-08-20 10:15:28,994 INFO  plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2007-08-20 10:15:28,994 INFO  plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-08-20 10:15:28,995 INFO  plugin.PluginRepository - OPIC Scoring Plug-in
(scoring-opic)
2007-08-20 10:15:28,995 INFO  plugin.PluginRepository - Registered
Extension-Points:
2007-08-20 10:15:28,995 INFO  plugin.PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-08-20 10:15:28,995 INFO  plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-08-20 10:15:28,995 INFO  plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-08-20 10:15:28,996 INFO  plugin.PluginRepository - Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer)
2007-08-20 10:15:28,996 INFO  plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-08-20 10:15:28,996 INFO  plugin.PluginRepository - HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-08-20 10:15:28,996 INFO  plugin.PluginRepository - Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-08-20 10:15:28,996 INFO  plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-08-20 10:15:28,996 INFO  plugin.PluginRepository - Nutch Content Parser
(org.apache.nutch.parse.Parser)
2007-08-20 10:15:28,997 INFO  plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-08-20 10:15:28,997 INFO  plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-08-20 10:15:28,997 INFO  plugin.PluginRepository - Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2007-08-20 10:15:29,140 INFO  conf.Configuration - found resource
crawl-urlfilter.txt at file:/var/www/html/nutch9loc/conf/crawl-urlfilter.txt
2007-08-20 10:15:29,166 DEBUG api.RegexURLFilterBase - Adding rule
[^(http|ftp|mailto):]
2007-08-20 10:15:29,171 DEBUG api.RegexURLFilterBase - Adding rule
[\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$]
2007-08-20 10:15:29,172 DEBUG api.RegexURLFilterBase - Adding rule [EMAIL 
PROTECTED]
2007-08-20 10:15:29,172 DEBUG api.RegexURLFilterBase - Adding rule [.*]
2007-08-20 10:15:29,190 DEBUG mapred.MapTask - Started thread: Sort progress
reporter for task map_g8nxpr
2007-08-20 10:15:29,191 INFO  mapred.LocalJobRunner -
file:/var/www/html/nutch9loc/urls/urls:0+205
2007-08-20 10:15:29,207 WARN  crawl.Injector - Skipping
smb://192.168.0.1/:java.net.MalformedURLException: unknown protocol: smb
2007-08-20 10:15:29,212 WARN  crawl.Injector - Skipping
smb://192.168.0.1:java.net.MalformedURLException: unknown protocol: smb
2007-08-20 10:15:29,213 WARN  crawl.Injector - Skipping
smb://192.168.0.101/:java.net.MalformedURLException: unknown protocol: smb
2007-08-20 10:15:29,214 WARN  crawl.Injector - Skipping
smb://192.168.0.101:java.net.MalformedURLException: unknown protocol: smb
2007-08-20 10:15:29,215 WARN  crawl.Injector - Skipping
smb:///192.168.0.101/:java.net.MalformedURLException: unknown protocol: smb
2007-08-20 10:15:29,216 WARN  crawl.Injector - Skipping
smb:///192.168.0.101:java.net.MalformedURLException: unknown protocol: smb
2007-08-20 10:15:29,217 WARN  regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2007-08-20 10:15:29,231 DEBUG mapred.MapTask - opened spill0.out
2007-08-20 10:15:29,305 INFO  mapred.LocalJobRunner -
file:/var/www/html/nutch9loc/urls/urls:0+205


WHY is that it is not including these above mentioned plugins??
how can i make nutch include these plugins so that i can crawl windows file
sharing...

same error:

Skipping smb://192.168.0.1:java.net.MalformedURLException: unknown protocol:
smb 


Please help me..

thanx
Bikram 




Renaud Richardet-4 wrote:
> 
> hi Bikram,
> 
> allright, you need to check the format of your config files 
> (nutch-default.xml and nutch-site.xml), they should both be well-formed 
> (opening-closing tags, etc.) xml, and have the correct structure:
> 
> <?xml version="1.0"?>
> <configuration>
>   <property>
>     <name>...</name>
>     <value>...</value>
>     <description>...</description>
>   </property>
> </configuration>
> 
> see the provided template for nutch-site.xml
> 
> HTH,
> Renaud
> 
> 
> bikram wrote:
>> Hi 
>>
>> After rechecking the Log i noticed this error toooo..
>>
>> 2007-08-20 12:11:32,449 FATAL conf.Configuration - bad conf file:
>> top-level
>> element not <configuration>
>> 2007-08-20 12:11:32,450 WARN  conf.Configuration - bad conf file: element
>> not <property>
>> 2007-08-20 12:11:32,490 INFO  crawl.Crawl - crawl started in: localcrawl4
>> 2007-08-20 12:11:32,491 INFO  crawl.Crawl - rootUrlDir = urls
>> 2007-08-20 12:11:32,491 INFO  crawl.Crawl - threads = 10
>> 2007-08-20 12:11:32,491 INFO  crawl.Crawl - depth = 5
>> 2007-08-20 12:11:32,632 FATAL conf.Configuration - bad conf file:
>> top-level
>> element not <configuration>
>> 2007-08-20 12:11:32,632 WARN  conf.Configuration - bad conf file: element
>> not <property>
>> 2007-08-20 12:11:32,641 INFO  crawl.Injector - Injector: starting
>> 2007-08-20 12:11:32,642 INFO  crawl.Injector - Injector: crawlDb:
>> localcrawl4/crawldb
>> 2007-08-20 12:11:32,642 INFO  crawl.Injector - Injector: urlDir: urls
>> 2007-08-20 12:11:32,643 INFO  crawl.Injector - Injector: Converting
>> injected
>> urls to crawl db entries.
>> 2007-08-20 12:11:32,643 DEBUG conf.Configuration - java.io.IOException:
>> config(config)
>>      at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:102)
>>      at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:77)
>>      at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:88)
>>      at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:27)
>>      at org.apache.nutch.crawl.Injector.inject(Injector.java:152)
>>      at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
>>
>>
>>
>> 2007-08-20 12:11:32,449 FATAL conf.Configuration - bad conf file:
>> top-level
>> element not <configuration>
>> 2007-08-20 12:11:32,450 WARN  conf.Configuration - bad conf file: element
>> not <property>
>>
>> Which configuration file is this message referring too..
>> I doubled checked my conf files but was not able to make it out which
>> file
>> and what line is causing this error...
>>
>> Can someone help me to rectify these errors and make nutch work...
>>
>>
>> thanx
>> Bikram
>>
>>
>>
>>   
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Windows-Share-Crawling---searching-tf4277499.html#a12255199
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to