Problem crawling local filesystem

ohaya Thu, 16 Jul 2009 10:37:13 -0700

Hi,

I'm trying to setup a test using Nutch to crawl the local file system.  This is 
on a Redhat system.  I'm basically following the procedure in these links:


http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch

http://markmail.org/message/pnmqd7ypguh7qtit

Here's my command line:

bin/nutch crawl fs-urls -dir acrawlfs1.test -depth 4 >& acrawlfs1.log


Here's what I get in the Nutch log file:

[r...@nssdemo nutch-1.0]# cat acrawlfs1.log
crawl started in: acrawlfs1.test
rootUrlDir = fs-urls
threads = 10
depth = 4
Injector: starting
Injector: crawlDb: acrawlfs1.test/crawldb
Injector: urlDir: fs-urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: acrawlfs1.test/segments/20090716101523
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Bad protocol in url:
Bad protocol in url: #file:///data/readings/semanticweb/
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: acrawlfs1.test/segments/20090716101523
Fetcher: threads: 10
QueueFeeder finished: total 1 records.
fetching file:///testfiles/ <file:///testfiles/>
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
org.apache.nutch.protocol.file.FileError: File Error: 404
        at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
fetch of file:///testfiles/ <file:///testfiles/> failed with: 
org.apache.nutch.protocol.file.FileError: File Error: 404
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: acrawlfs1.test/crawldb
CrawlDb update: segments: [acrawlfs1.test/segments/20090716101523]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: acrawlfs1.test/segments/20090716101532
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Bad protocol in url:
Bad protocol in url: #file:///data/readings/semanticweb/
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: acrawlfs1.test/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: 
file:/opt/nutch-1.0/acrawlfs1.test/segments/20090716101523
LinkDb: done
Indexer: starting
Indexer: done
Dedup: starting
Dedup: adding indexes in: acrawlfs1.test/indexes
Dedup: done
merging indexes to: acrawlfs1.test/index
Adding file:/opt/nutch-1.0/acrawlfs1.test/indexes/part-00000
done merging
crawl finished: acrawlfs1.test

Here's my conf/nutch-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>


<property>
<name>plugin.includes</name>
<value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)</value>
</property>
<property>
<name>file.content.limit</name> <value>-1</value>
</property>

</configuration>


and, my crawl-urlfilter.txt:

[r...@nssdemo nutch-1.0]# cat conf/crawl-urlfilter.txt
#skip http:, ftp:, & mailto: urls
##-^(file|ftp|mailto):

-^(http|ftp|mailto):

#skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

#skip URLs containing certain characters as probable queries, etc.
-[...@=]

#accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

#accecpt anything else
+.*

And, in fs-urls, I have urls:

file:///testfiles/ <file:///testfiles/>

#file:///data/readings/semanticweb/

For this test, I have a /testfiles directory, with a bunch of .txt files under 
two directories /testfiles/Content1 and /testfiles/Content2.

It looks like the crawl goes to the end, and creates the directories and files 
under acrawlfs1.test, but when I run Luke on the index directory, I got an 
error, with a popup window with just "0" in it.

Is the problem because of that 404 error in the log?  If so, why am I getting 
that 404 error?

Thanks,
Jim

Problem crawling local filesystem

Reply via email to