Ok I got it I used also the pluigin protocol-smb and there was an error in the plugin.xml Plugin id was set to "protocol-file" and not to protocol-smb so there were 2 protocol-file
check out https://issues.apache.org/jira/browse/NUTCH-427 <?xml version="1.0" encoding="UTF-8" ?> - <!-- Document : plugin.xml Created on : 03 January 2007, 10:41 Author : Armel T. Nene Description: This file is used by Nutch to configure the SMB protocol --> - <plugin id="protocol-file" name="SMB Protocol Plug-in" version="1.0.0" provider-name="iDNA Solutions LTD"> - <runtime> - <library name="protocol-smb.jar"> <export name="*" /> </library> <library name="jcifs-1.2.12.jar" /> </runtime> - <requires> <import plugin="nutch-extensionpoints" /> </requires> - <extension id="org.apache.nutch.protocol.smb" name="SMBProtocol" point="org.apache.nutch.protocol.Protocol"> - <implementation id="org.apache.nutch.protocol.smb.SMB" class="org.apache.nutch.protocol.smb.SMB"> <parameter name="protocolName" value="SMB" /> </implementation> </extension> </plugin> Ever wrote: > > Hi there, > I have a problem getting the local filesystem crawled by nutch. My current > config is as following: Nutch trunc Version which compiled nicely without > any errors especially protocol-file is ok. I also tryed to insert the > protocol-file.jar to the lib path but had the same bad result. Is there > something more i can look for? > > Thank you in advance ! > > regards > > ========== Log Output==================== > bash-3.2$ ./nutch crawl urls.txt -dir crawl -threads 1 > crawl started in: crawl > rootUrlDir = urls.txt > threads = 1 > depth = 5 > Injector: starting > Injector: crawlDb: crawl/crawldb > Injector: urlDir: urls.txt > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: crawl/segments/20070521184833 > Generator: filtering: false > Generator: topN: 2147483647 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: crawl/segments/20070521184833 > Fetcher: threads: 1 > fetching file:///C:/temp/test/ > fetch of file:///C:/temp/test/ failed with: > org.apache.nutch.protocol.ProtocolNotFound: protocol not found for > url=file > Fetcher: done > CrawlDb update: starting > .... > > ===================================== > > My Configuration: > > <configuration> > > <property> > <name>file.content.limit</name> > <value>-1</value> > </property> > > <property> > <name>plugin.includes</name> > > <value>protocol-file|protocol-smb|urlfilter-crawl|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > <description> > </description> > </property> > > </configuration> > > > ================== > > my crawl-urlfilter.txt > > # skip file:, ftp:, & mailto: urls > -^(ftp|mailto): > +^(file|smb): > > # skip image and other suffixes we can't yet parse > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > # accept hosts in MY.DOMAIN.NAME > # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ > #+^http://([a-z0-9]*\.)*apache.org/ > > # skip everything else > -. > > ========== > > > My Commandline Args I got them by echo the last line of ./nutch > C:\Programme\Java_jdk1.5.0_04/bin/java -Xmx1000m > -Dhadoop.log.dir=c:\Dipl\Nutch\logs -Dhadoop.log.file=hadoop.log > -Djava.library.path=c:\Dipl\Nutch\lib\native\Windows_XP-x86-32 > -Djava.protocol.handler.pkgs=jcifs -classpath > c:\Dipl\Nutch\conf;C;C:\Programme\Java_jdk1.5.0_04\lib\tools.jar;c:\Dipl\Nutch\build;c:\Dipl\Nutch\build\nutch-1.0-dev.job;c:\Dipl\Nutch\build\test\classes;c:\Dipl\Nutch\nutch-*.job;c:\Dipl\Nutch\lib\commons-cli-2.0-SNAPSHOT.jar;c:\Dipl\Nutch\lib\commons-codec-1.3.jar;c:\Dipl\Nutch\lib\commons-httpclient-3.0.1.jar;c:\Dipl\Nutch\lib\commons-lang-2.1.jar;c:\Dipl\Nutch\lib\commons-logging-1.0.4.jar;c:\Dipl\Nutch\lib\commons-logging-api-1.0.4.jar;c:\Dipl\Nutch\lib\hadoop-0.12.2-core.jar;c:\Dipl\Nutch\lib\jakarta-oro-2.0.7.jar;c:\Dipl\Nutch\lib\jets3t-0.5.0.jar;c:\Dipl\Nutch\lib\jetty-5.1.4.jar;c:\Dipl\Nutch\lib\junit-3.8.1.jar;c:\Dipl\Nutch\lib\log4j-1.2.13.jar;c:\Dipl\Nutch\lib\lucene-core-2.1.0.jar;c:\Dipl\Nutch\lib\lucene-misc-2.1.0.jar;c:\Dipl\Nutch\lib\servlet-api.jar;c:\Dipl\Nutch\lib\taglibs-i18n.jar;c:\Dipl\Nutch\lib\xerces-2_6_2-apis.jar;c:\Dipl\Nutch\lib\xerces-2_6_2.jar;c:\Dipl\Nutch\lib\jetty-ext\ant.jar;c:\Dipl\Nutch\lib\jetty-ext\commons-el.jar;c:\Dipl\Nutch\lib\j etty-ext\jasper-compiler.jar;c:\Dipl\Nutch\lib\jetty-ext\jasper-runtime.jar;c:\Dipl\Nutch\lib\jetty-ext\jsp-api.jar;C:\Dipl\Nutch\build\protocol-file\protocol-file.jar > org.apache.nutch.crawl.Crawl urls.txt -dir crawl > > > ========= > My Urls.txt > > file:///C:/temp/test/ > > > -- View this message in context: http://www.nabble.com/Crawling-Local-file-System-tf3791589.html#a10737391 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
