Hi, I am running Nutch 0.9 and am attempting to use it to index files on my
local file system without much luck. I believe I have configured things
correctly, however, no files are being indexed and no errors being reported.
Note that I have looked thru the various posts on this topic on the mailing
list and tired various variations on the configuration. I am providing details
of my configuration and log files below. I would appreciate any insight people
might have. Best,mw Details:OS: Windows Vista (note I have turned off defender
and firewall)<comand> bin/nutch crawl urls -dir crawl_results -depth 4 -topN
500 >& logs/crawl.logurls files contains
only```````````````````````````````````````````````````file:///C:/MyData/```````````````````````````````````````````````````Nutch-site.xml`````````````````````````````````````<property>
<name>http.agent.url</name> <value></value>
<description>none</description></property><property>
<name>http.agent.email</name> <value>none</value>
<description></description></property><property> <name>plugin.includes</name>
<value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value></property><property><name>file.content.limit</name>
<value>-1</value></property>
</configuration>```````````````````````````````````````````````````crawl-urlfilters.txt```````````````````````````````````````````````````#
The url filter file used by the crawl command.# Better for intranet crawling.#
Be sure to change MY.DOMAIN.NAME to your domain name.# Each non-comment,
non-blank line contains a regular expression# prefixed by '+' or '-'. The
first matching pattern in the file# determines whether a URL is included or
ignored. If no pattern# matches, the URL is ignored.# skip file:, ftp:, &
mailto: urls# -^(file|ftp|mailto):# skip http:, ftp:, & mailto:
urls-^(http|ftp|mailto):# skip image and other suffixes we can't yet
parse-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$#
skip URLs containing certain characters as probable queries, [EMAIL PROTECTED]
skip URLs with slash-delimited segment that repeats 3+ times, to break loops#
-.*(/.+?)/.*?\1/.*?\1/# accept hosts in MY.DOMAIN.NAME#
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/# skip everything else# -.# get
everything
else+^file:///C:/MyData/*-.*```````````````````````````````````````````````````
_________________________________________________________________
Want to do more with Windows Live? Learn “10 hidden secrets” from Jamie.
http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns!550F681DAD532637!5295.entry?ocid=TXT_TAGLM_WL_domore_092008