hi Bikram,

bikram wrote:
hi..

- try to increase the log level of the plugin loader, to see if all plugins
are loaded successfully
add this line in conf/log4j.properties:
log4j.logger.org.apache.nutch.plugin=DEBUG,cmdstdout

but actually, first check the logs in logs/hadoop.log, there should be a line like
INFO  plugin.PluginRepository -     File Protocol Plug-in (protocol-file)
and another for the samba plugin. otherwise, your configuration is not correct.

HTH,
Renaud


sorry for being so naive
How to increase the log level of the plugin loader ??

thanx
Bikram



bikram wrote:
Hi Renaud

Firstly Thanx for the reply...

Yes i have read about the issues and did the following....

1) copied JCIFS jar fom protocol-smb to JAVA_HOME/jre/lib/ext 2) Have set the JVM options to "-Djava.protocol.handler.pkgs=jcifs" in the
profile only

but same error

Skipping smb://192.168.0.1:java.net.MalformedURLException: unknown
protocol: smb

Even the File is not working

file:///root/test.txt failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
url=file

thanx Bikram


Renaud Richardet-4 wrote:
hi Bikram,

- have you read the issues described in http://issues.apache.org/jira/browse/NUTCH-427? - try to increase the log level of the plugin loader, to see if all plugins are loaded successfully

HTH,
Renaud


bikram wrote:
Hi all

I am new to nutch..
I have downloaded Nutch 9.0


I want to crawl my local network (Windows shares & Linux  shares)

tried this link as referance
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch

1) Downloaded the  protocol-smb

http://issues.apache.org/jira/browse/NUTCH-427

2) Made following changes in crawler-urlfilter.txt

# skip file:, ftp:, & mailto: urls
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to
 break loops
[EMAIL PROTECTED]

# skip everything else
# -.

# accept anything else +.*


3) Made following changes in nutch-site.xml

<property>
  <name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-smb|protocol-file|urlfilter-regex|parse-(text|html|js|pdf|msword|zip|mspowerpoint|msexcel)|index-basic|query-(basic|sit
e|url)</value>
  <description></description>
</property>



4) the urls file consists smb:hostnames/shares

5) The windows login details >> username/password/ip address etc are
 entered in smb.properties

6) bin/nutch crawl urls -dir localcrawl  give error

smb://192.168.0.1/:java.net.MalformedURLException: unknown protocol:
 smb

7) Tried crawling Files but got following error

file:///var/test.txt failed with:
 org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
url=file

Is the above setting correct to crawl local windows shares

Can some one guide me what to do ... where am i wrong???

Thanx

Bikram



Reply via email to