FW: Indexing Files on Local File System

Manu Warikoo Thu, 25 Sep 2008 11:13:06 -0700



Hi, I am running Nutch 0.9 and am attempting to use it to index files on my 
local file system without much luck. I believe I have configured things 
correctly, however, no files are being indexed and no errors being reported. 
Note that I have looked thru the various posts on this topic on the mailing 
list and tired various variations on the configuration. I am providing details 
of my configuration and log files below. I would appreciate any insight people 
might have. Best,mw Details:OS: Windows Vista (note I have turned off defender 
and firewall)<comand> bin/nutch crawl urls -dir crawl_results -depth 4 -topN 
500 >& logs/crawl.logurls files contains 
only```````````````````````````````````````````````````file:///C:/MyData/```````````````````````````````````````````````````Nutch-site.xml`````````````````````````````````````<property>
 <name>http.agent.url</name> <value></value> 
<description>none</description></property><property> 
<name>http.agent.email</name> <value>none</value> 
<description></description></property><property> <name>plugin.includes</name> 
<value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value></property><property><name>file.content.limit</name>
 <value>-1</value></property> 
</configuration>```````````````````````````````````````````````````crawl-urlfilters.txt```````````````````````````````````````````````````#
 The url filter file used by the crawl command.# Better for intranet crawling.# 
Be sure to change MY.DOMAIN.NAME to your domain name.# Each non-comment, 
non-blank line contains a regular expression# prefixed by '+' or '-'.  The 
first matching pattern in the file# determines whether a URL is included or 
ignored.  If no pattern# matches, the URL is ignored.# skip file:, ftp:, & 
mailto: urls# -^(file|ftp|mailto):# skip http:, ftp:, & mailto: 
urls-^(http|ftp|mailto):# skip image and other suffixes we can't yet 
parse-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$#
 skip URLs containing certain characters as probable queries, [EMAIL PROTECTED] 
skip URLs with slash-delimited segment that repeats 3+ times, to break loops# 
-.*(/.+?)/.*?\1/.*?\1/# accept hosts in MY.DOMAIN.NAME# 
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/# skip everything else# -.# get 
everything 
else+^file:///C:/MyData/*-.*```````````````````````````````````````````````````
_________________________________________________________________
Want to do more with Windows Live? Learn “10 hidden secrets” from Jamie.
http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns!550F681DAD532637!5295.entry?ocid=TXT_TAGLM_WL_domore_092008
FW: Indexing Files on Local File System

Reply via email to