RE: Indexing Files on Local File System

Manu Warikoo Thu, 25 Sep 2008 13:54:10 -0700
hi, 
Thanks for responding.
Just tried the changes that you suggested, no change.
log files look exactly the same expect that now the dir ref comes up with only 
2 /.
any other possible things?
mw> Date: Fri, 26 Sep 2008 01:19:14 +0530> From: [EMAIL PROTECTED]> To: 
[email protected]> Subject: Re: FW: Indexing Files on Local File 
System> > hi,> You should change the url as file://C:/MyData/ and also in> 
crawl-urlfilter.txt change the file:// line to> +^file://C:/MyData/*> > On Thu, 
Sep 25, 2008 at 11:42 PM, Manu Warikoo <[EMAIL PROTECTED]> wrote:> > > Hi,> >> 
> I am running Nutch 0.9 and am attempting to use it to index files on my> > 
local file system without much luck. I believe I have configured things> > 
correctly, however, no files are being indexed and no errors being reported.> > 
Note that I have looked thru the various posts on this topic on the mailing> > 
list and tired various variations on the configuration.> >> > I am providing 
details of my configuration and log files below. I would> > appreciate any 
insight people might have.> > Best,> > mw> >> > Details:> > OS: Windows Vista 
(note I have turned off defender and firewall)> > <comand> bin/nutch crawl urls 
-dir crawl_results -depth 4 -topN 500 >&> > logs/crawl.log> > urls files 
contains only> > ```````````````````````````````````````````````````> > 
file:///C:/MyData/> >> > ```````````````````````````````````````````````````> > 
Nutch-site.xml> > `````````````````````````````````````> > <property>> > 
<name>http.agent.url</name>> > <value></value>> > 
<description>none</description>> > </property>> > <property>> > 
<name>http.agent.email</name>> > <value>none</value>> > 
<description></description>> > </property>> >> > <property>> > 
<name>plugin.includes</name>> >> > 
<value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>>
 > </property>> > <property>> > <name>file.content.limit</name> 
<value>-1</value>> > </property>> > </configuration>> > 
```````````````````````````````````````````````````> > crawl-urlfilters.txt> > 
```````````````````````````````````````````````````> > # The url filter file 
used by the crawl command.> > # Better for intranet crawling.> > # Be sure to 
change MY.DOMAIN.NAME to your domain name.> > # Each non-comment, non-blank 
line contains a regular expression> > # prefixed by '+' or '-'. The first 
matching pattern in the file> > # determines whether a URL is included or 
ignored. If no pattern> > # matches, the URL is ignored.> > # skip file:, ftp:, 
& mailto: urls> > # -^(file|ftp|mailto):> > # skip http:, ftp:, & mailto: urls> 
> -^(http|ftp|mailto):> > # skip image and other suffixes we can't yet parse> 
>> > 
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$>
 > # skip URLs containing certain characters as probable queries, etc.> > 
[EMAIL PROTECTED]> > # skip URLs with slash-delimited segment that repeats 3+ 
times, to break> > loops> > # -.*(/.+?)/.*?\1/.*?\1/> > # accept hosts in 
MY.DOMAIN.NAME> > # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/> > # skip everything 
else> > # -.> > # get everything else> > +^file:///C:/MyData/*> > -.*> > 
```````````````````````````````````````````````````> >> > 
------------------------------> > Want to do more with Windows Live? Learn "10 
hidden secrets" from Jamie. Learn> > 
Now<http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns%21550F681DAD532637%215295.entry?ocid=TXT_TAGLM_WL_domore_092008>>
 >
_________________________________________________________________
See how Windows Mobile brings your life together—at home, work, or on the go.
http://clk.atdmt.com/MRT/go/msnnkwxp1020093182mrt/direct/01/
RE: Indexing Files on Local File System

Reply via email to