Hi there,
I have a problem getting the local filesystem crawled by nutch. My current
config is as following: Nutch trunc Version which compiled nicely without
any errors especially protocol-file is ok. I also tryed to insert the
protocol-file.jar to the lib path but had the same bad result. Is there
something more i can look for?
Thank you in advance !
regards
========== Log Output====================
bash-3.2$ ./nutch crawl urls.txt -dir crawl -threads 1
crawl started in: crawl
rootUrlDir = urls.txt
threads = 1
depth = 5
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20070521184833
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20070521184833
Fetcher: threads: 1
fetching file:///C:/temp/test/
fetch of file:///C:/temp/test/ failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file
Fetcher: done
CrawlDb update: starting
....
=====================================
My Configuration:
<configuration>
<property>
<name>file.content.limit</name>
<value>-1</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-file|protocol-smb|urlfilter-crawl|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>
</description>
</property>
</configuration>
==================
my crawl-urlfilter.txt
# skip file:, ftp:, & mailto: urls
-^(ftp|mailto):
+^(file|smb):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept hosts in MY.DOMAIN.NAME
# Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
#+^http://([a-z0-9]*\.)*apache.org/
# skip everything else
-.
==========
My Commandline Args I got them by echo the last line of ./nutch
C:\Programme\Java_jdk1.5.0_04/bin/java -Xmx1000m
-Dhadoop.log.dir=c:\Dipl\Nutch\logs -Dhadoop.log.file=hadoop.log
-Djava.library.path=c:\Dipl\Nutch\lib\native\Windows_XP-x86-32
-Djava.protocol.handler.pkgs=jcifs -classpath
c:\Dipl\Nutch\conf;C;C:\Programme\Java_jdk1.5.0_04\lib\tools.jar;c:\Dipl\Nutch\build;c:\Dipl\Nutch\build\nutch-1.0-dev.job;c:\Dipl\Nutch\build\test\classes;c:\Dipl\Nutch\nutch-*.job;c:\Dipl\Nutch\lib\commons-cli-2.0-SNAPSHOT.jar;c:\Dipl\Nutch\lib\commons-codec-1.3.jar;c:\Dipl\Nutch\lib\commons-httpclient-3.0.1.jar;c:\Dipl\Nutch\lib\commons-lang-2.1.jar;c:\Dipl\Nutch\lib\commons-logging-1.0.4.jar;c:\Dipl\Nutch\lib\commons-logging-api-1.0.4.jar;c:\Dipl\Nutch\lib\hadoop-0.12.2-core.jar;c:\Dipl\Nutch\lib\jakarta-oro-2.0.7.jar;c:\Dipl\Nutch\lib\jets3t-0.5.0.jar;c:\Dipl\Nutch\lib\jetty-5.1.4.jar;c:\Dipl\Nutch\lib\junit-3.8.1.jar;c:\Dipl\Nutch\lib\log4j-1.2.13.jar;c:\Dipl\Nutch\lib\lucene-core-2.1.0.jar;c:\Dipl\Nutch\lib\lucene-misc-2.1.0.jar;c:\Dipl\Nutch\lib\servlet-api.jar;c:\Dipl\Nutch\lib\taglibs-i18n.jar;c:\Dipl\Nutch\lib\xerces-2_6_2-apis.jar;c:\Dipl\Nutch\lib\xerces-2_6_2.jar;c:\Dipl\Nutch\lib\jetty-ext\ant.jar;c:\Dipl\Nutch\lib\jetty-ext\commons-el.jar;c:\Dipl\Nutch\lib\jet
ty-ext\jasper-compiler.jar;c:\Dipl\Nutch\lib\jetty-ext\jasper-runtime.jar;c:\Dipl\Nutch\lib\jetty-ext\jsp-api.jar;C:\Dipl\Nutch\build\protocol-file\protocol-file.jar
org.apache.nutch.crawl.Crawl urls.txt -dir crawl
=========
My Urls.txt
file:///C:/temp/test/
--
View this message in context:
http://www.nabble.com/Crawling-Local-file-System-tf3791589.html#a10722948
Sent from the Nutch - User mailing list archive at Nabble.com.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general