Hi, I'm trying to setup a test using Nutch to crawl the local file system. This is on a Redhat system. I'm basically following the procedure in these links:
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch http://markmail.org/message/pnmqd7ypguh7qtit Here's my command line: bin/nutch crawl fs-urls -dir acrawlfs1.test -depth 4 >& acrawlfs1.log Here's what I get in the Nutch log file: [r...@nssdemo nutch-1.0]# cat acrawlfs1.log crawl started in: acrawlfs1.test rootUrlDir = fs-urls threads = 10 depth = 4 Injector: starting Injector: crawlDb: acrawlfs1.test/crawldb Injector: urlDir: fs-urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: acrawlfs1.test/segments/20090716101523 Generator: filtering: true Generator: jobtracker is 'local', generating exactly one partition. Bad protocol in url: Bad protocol in url: #file:///data/readings/semanticweb/ Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: acrawlfs1.test/segments/20090716101523 Fetcher: threads: 10 QueueFeeder finished: total 1 records. fetching file:///testfiles/ <file:///testfiles/> -finishing thread FetcherThread, activeThreads=9 -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 org.apache.nutch.protocol.file.FileError: File Error: 404 at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535) fetch of file:///testfiles/ <file:///testfiles/> failed with: org.apache.nutch.protocol.file.FileError: File Error: 404 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: done CrawlDb update: starting CrawlDb update: db: acrawlfs1.test/crawldb CrawlDb update: segments: [acrawlfs1.test/segments/20090716101523] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: acrawlfs1.test/segments/20090716101532 Generator: filtering: true Generator: jobtracker is 'local', generating exactly one partition. Bad protocol in url: Bad protocol in url: #file:///data/readings/semanticweb/ Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting LinkDb: linkdb: acrawlfs1.test/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/opt/nutch-1.0/acrawlfs1.test/segments/20090716101523 LinkDb: done Indexer: starting Indexer: done Dedup: starting Dedup: adding indexes in: acrawlfs1.test/indexes Dedup: done merging indexes to: acrawlfs1.test/index Adding file:/opt/nutch-1.0/acrawlfs1.test/indexes/part-00000 done merging crawl finished: acrawlfs1.test Here's my conf/nutch-site.xml: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>plugin.includes</name> <value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)</value> </property> <property> <name>file.content.limit</name> <value>-1</value> </property> </configuration> and, my crawl-urlfilter.txt: [r...@nssdemo nutch-1.0]# cat conf/crawl-urlfilter.txt #skip http:, ftp:, & mailto: urls ##-^(file|ftp|mailto): -^(http|ftp|mailto): #skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ #skip URLs containing certain characters as probable queries, etc. -[...@=] #accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ #accecpt anything else +.* And, in fs-urls, I have urls: file:///testfiles/ <file:///testfiles/> #file:///data/readings/semanticweb/ For this test, I have a /testfiles directory, with a bunch of .txt files under two directories /testfiles/Content1 and /testfiles/Content2. It looks like the crawl goes to the end, and creates the directories and files under acrawlfs1.test, but when I run Luke on the index directory, I got an error, with a popup window with just "0" in it. Is the problem because of that 404 error in the log? If so, why am I getting that 404 error? Thanks, Jim