Hi all I am new to nutch..
I have downloaded Nutch 9.0 I want to crawl my local network (Windows shares & Linux shares) tried this link as referance http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch 1) Downloaded the protocol-smb http://issues.apache.org/jira/browse/NUTCH-427 2) Made following changes in crawler-urlfilter.txt # skip file:, ftp:, & mailto: urls -^(http|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops [EMAIL PROTECTED] # skip everything else # -. # accept anything else +.* 3) Made following changes in nutch-site.xml <property> <name>plugin.includes</name> <value>nutch-extensionpoints|protocol-smb|protocol-file|urlfilter-regex|parse-(text|html|js|pdf|msword|zip|mspowerpoint|msexcel)|index-basic|query-(basic|sit e|url)</value> <description></description> </property> 4) the urls file consists smb:hostnames/shares 5) The windows login details >> username/password/ip address etc are entered in smb.properties 6) bin/nutch crawl urls -dir localcrawl give error smb://192.168.0.1/:java.net.MalformedURLException: unknown protocol: smb 7) Tried crawling Files but got following error file:///var/test.txt failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file Is the above setting correct to crawl local windows shares Can some one guide me what to do ... where am i wrong??? Thanx Bikram -- View this message in context: http://www.nabble.com/Windows-Share-Crawling---searching-tf4277499.html#a12175266 Sent from the Nutch - User mailing list archive at Nabble.com.
