Hi all

I am new to nutch.. 

I have downloaded Nutch 9.0


I want to crawl my local network (Windows shares & Linux  shares)

tried this link as referance
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch 


1) Downloaded the  protocol-smb

http://issues.apache.org/jira/browse/NUTCH-427

2) Made following changes in crawler-urlfilter.txt

# skip file:, ftp:, & mailto: urls
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to
 break loops
[EMAIL PROTECTED]

# skip everything else
# -.

# accept anything else 
+.*


3) Made following changes in nutch-site.xml

<property>
  <name>plugin.includes</name>
              
 
<value>nutch-extensionpoints|protocol-smb|protocol-file|urlfilter-regex|parse-(text|html|js|pdf|msword|zip|mspowerpoint|msexcel)|index-basic|query-(basic|sit
e|url)</value>
  <description></description>
</property>



4) the urls file consists smb:hostnames/shares

5) The windows login details >> username/password/ip address etc are
 entered in smb.properties

6) bin/nutch crawl urls -dir localcrawl  give error

smb://192.168.0.1/:java.net.MalformedURLException: unknown protocol:
 smb

7) Tried crawling Files but got following error

file:///var/test.txt failed with:
 org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file

Is the above setting correct to crawl local windows shares

                                
Can some one guide me what to do ... where am i wrong???

Thanx

Bikram
-- 
View this message in context: 
http://www.nabble.com/Windows-Share-Crawling---searching-tf4277499.html#a12175266
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to