Hi Renaud Firstly Thanx for the reply...
Yes i have read about the issues and did the following.... 1) copied JCIFS jar fom protocol-smb to JAVA_HOME/jre/lib/ext 2) Have set the JVM options to "-Djava.protocol.handler.pkgs=jcifs" in the profile only but same error Skipping smb://192.168.0.1:java.net.MalformedURLException: unknown protocol: smb Even the File is not working file:///root/test.txt failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file thanx Bikram Renaud Richardet-4 wrote: > > hi Bikram, > > - have you read the issues described in > http://issues.apache.org/jira/browse/NUTCH-427? > - try to increase the log level of the plugin loader, to see if all > plugins are loaded successfully > > HTH, > Renaud > > > bikram wrote: >> Hi all >> >> I am new to nutch.. >> >> I have downloaded Nutch 9.0 >> >> >> I want to crawl my local network (Windows shares & Linux shares) >> >> tried this link as referance >> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch >> >> >> 1) Downloaded the protocol-smb >> >> http://issues.apache.org/jira/browse/NUTCH-427 >> >> 2) Made following changes in crawler-urlfilter.txt >> >> # skip file:, ftp:, & mailto: urls >> -^(http|ftp|mailto): >> >> # skip image and other suffixes we can't yet parse >> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ >> >> # skip URLs containing certain characters as probable queries, etc. >> [EMAIL PROTECTED] >> >> # skip URLs with slash-delimited segment that repeats 3+ times, to >> break loops >> [EMAIL PROTECTED] >> >> # skip everything else >> # -. >> >> # accept anything else >> +.* >> >> >> 3) Made following changes in nutch-site.xml >> >> <property> >> <name>plugin.includes</name> >> >> >> <value>nutch-extensionpoints|protocol-smb|protocol-file|urlfilter-regex|parse-(text|html|js|pdf|msword|zip|mspowerpoint|msexcel)|index-basic|query-(basic|sit >> e|url)</value> >> <description></description> >> </property> >> >> >> >> 4) the urls file consists smb:hostnames/shares >> >> 5) The windows login details >> username/password/ip address etc are >> entered in smb.properties >> >> 6) bin/nutch crawl urls -dir localcrawl give error >> >> smb://192.168.0.1/:java.net.MalformedURLException: unknown protocol: >> smb >> >> 7) Tried crawling Files but got following error >> >> file:///var/test.txt failed with: >> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for >> url=file >> >> Is the above setting correct to crawl local windows shares >> >> >> Can some one guide me what to do ... where am i wrong??? >> >> Thanx >> >> Bikram >> > > > -- View this message in context: http://www.nabble.com/Windows-Share-Crawling---searching-tf4277499.html#a12193969 Sent from the Nutch - User mailing list archive at Nabble.com.
