hi Bikram,
bikram wrote:
hi..
- try to increase the log level of the plugin loader, to see if all plugins
are loaded successfully
add this line in conf/log4j.properties:
log4j.logger.org.apache.nutch.plugin=DEBUG,cmdstdout
but actually, first check the logs in logs/hadoop.log, there should be a
line like
INFO plugin.PluginRepository - File Protocol Plug-in (protocol-file)
and another for the samba plugin. otherwise, your configuration is not
correct.
HTH,
Renaud
sorry for being so naive
How to increase the log level of the plugin loader ??
thanx
Bikram
bikram wrote:
Hi Renaud
Firstly Thanx for the reply...
Yes i have read about the issues and did the following....
1) copied JCIFS jar fom protocol-smb to JAVA_HOME/jre/lib/ext
2) Have set the JVM options to "-Djava.protocol.handler.pkgs=jcifs" in the
profile only
but same error
Skipping smb://192.168.0.1:java.net.MalformedURLException: unknown
protocol: smb
Even the File is not working
file:///root/test.txt failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
url=file
thanx
Bikram
Renaud Richardet-4 wrote:
hi Bikram,
- have you read the issues described in
http://issues.apache.org/jira/browse/NUTCH-427?
- try to increase the log level of the plugin loader, to see if all
plugins are loaded successfully
HTH,
Renaud
bikram wrote:
Hi all
I am new to nutch..
I have downloaded Nutch 9.0
I want to crawl my local network (Windows shares & Linux shares)
tried this link as referance
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
1) Downloaded the protocol-smb
http://issues.apache.org/jira/browse/NUTCH-427
2) Made following changes in crawler-urlfilter.txt
# skip file:, ftp:, & mailto: urls
-^(http|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# skip URLs with slash-delimited segment that repeats 3+ times, to
break loops
[EMAIL PROTECTED]
# skip everything else
# -.
# accept anything else
+.*
3) Made following changes in nutch-site.xml
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-smb|protocol-file|urlfilter-regex|parse-(text|html|js|pdf|msword|zip|mspowerpoint|msexcel)|index-basic|query-(basic|sit
e|url)</value>
<description></description>
</property>
4) the urls file consists smb:hostnames/shares
5) The windows login details >> username/password/ip address etc are
entered in smb.properties
6) bin/nutch crawl urls -dir localcrawl give error
smb://192.168.0.1/:java.net.MalformedURLException: unknown protocol:
smb
7) Tried crawling Files but got following error
file:///var/test.txt failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
url=file
Is the above setting correct to crawl local windows shares
Can some one guide me what to do ... where am i wrong???
Thanx
Bikram