[EMAIL PROTECTED] 写道:
Hi all
I am new to nutch..
I have downloaded Nutch 9.0
I want to crawl my local network (Windows shares & Linux shares)
tried this link as referance
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
1) Downloaded the protocol-smb
http://issues.apache.org/jira/browse/NUTCH-427
2) Made following changes in crawler-urlfilter.txt
# skip file:, ftp:, & mailto: urls
-^(http|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
[EMAIL PROTECTED]
# skip everything else
# -.
# accept anything else
+.*
3) Made following changes in nutch-site.xml
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-smb|protocol-file|urlfilter-regex|parse-(text|html|js|pdf|msword|zip|mspowerpoint|msexcel)|index-basic|query-(basic|sit
e|url)</value>
<description></description>
</property>
4) the urls file consists smb:hostnames/shares
5) The windows login details >> username/password/ip address etc are entered in
smb.properties
6) bin/nutch crawl urls -dir localcrawl give error
smb://192.168.0.1/:java.net.MalformedURLException: unknown protocol: smb
7) Tried crawling Files but got following error
file:///var/test.txt failed with: org.apache.nutch.protocol.ProtocolNotFound:
protocol not found for url=file
Is the above setting correct to crawl local windows shares
Can some one guide me what to do ... where am i wrong???
Thanx
Bikram
Hi
protocol-smb is a plugin of nutch,see the following link to get any help
http://wiki.apache.org/nutch/WritingPluginExample-0.9
remember to ant after you add this protocol to nutch
and for checking whether the plugin has been actived,Use command
bin/nutch plugin protocol-smb org.apache.nutch.protocol.smb.[class name here!]