hi,
Thanks for responding.
Just tried the changes that you suggested, no change.
log files look exactly the same expect that now the dir ref comes up with only
2 /.
any other possible things?
mw> Date: Fri, 26 Sep 2008 01:19:14 +0530> From: [EMAIL PROTECTED]> To:
[email protected]> Subject: Re: FW: Indexing Files on Local File
System> > hi,> You should change the url as file://C:/MyData/ and also in>
crawl-urlfilter.txt change the file:// line to> +^file://C:/MyData/*> > On Thu,
Sep 25, 2008 at 11:42 PM, Manu Warikoo <[EMAIL PROTECTED]> wrote:> > > Hi,> >>
> I am running Nutch 0.9 and am attempting to use it to index files on my> >
local file system without much luck. I believe I have configured things> >
correctly, however, no files are being indexed and no errors being reported.> >
Note that I have looked thru the various posts on this topic on the mailing> >
list and tired various variations on the configuration.> >> > I am providing
details of my configuration and log files below. I would> > appreciate any
insight people might have.> > Best,> > mw> >> > Details:> > OS: Windows Vista
(note I have turned off defender and firewall)> > <comand> bin/nutch crawl urls
-dir crawl_results -depth 4 -topN 500 >&> > logs/crawl.log> > urls files
contains only> > ```````````````````````````````````````````````````> >
file:///C:/MyData/> >> > ```````````````````````````````````````````````````> >
Nutch-site.xml> > `````````````````````````````````````> > <property>> >
<name>http.agent.url</name>> > <value></value>> >
<description>none</description>> > </property>> > <property>> >
<name>http.agent.email</name>> > <value>none</value>> >
<description></description>> > </property>> >> > <property>> >
<name>plugin.includes</name>> >> >
<value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>>
> </property>> > <property>> > <name>file.content.limit</name>
<value>-1</value>> > </property>> > </configuration>> >
```````````````````````````````````````````````````> > crawl-urlfilters.txt> >
```````````````````````````````````````````````````> > # The url filter file
used by the crawl command.> > # Better for intranet crawling.> > # Be sure to
change MY.DOMAIN.NAME to your domain name.> > # Each non-comment, non-blank
line contains a regular expression> > # prefixed by '+' or '-'. The first
matching pattern in the file> > # determines whether a URL is included or
ignored. If no pattern> > # matches, the URL is ignored.> > # skip file:, ftp:,
& mailto: urls> > # -^(file|ftp|mailto):> > # skip http:, ftp:, & mailto: urls>
> -^(http|ftp|mailto):> > # skip image and other suffixes we can't yet parse>
>> >
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$>
> # skip URLs containing certain characters as probable queries, etc.> >
[EMAIL PROTECTED]> > # skip URLs with slash-delimited segment that repeats 3+
times, to break> > loops> > # -.*(/.+?)/.*?\1/.*?\1/> > # accept hosts in
MY.DOMAIN.NAME> > # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/> > # skip everything
else> > # -.> > # get everything else> > +^file:///C:/MyData/*> > -.*> >
```````````````````````````````````````````````````> >> >
------------------------------> > Want to do more with Windows Live? Learn "10
hidden secrets" from Jamie. Learn> >
Now<http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns%21550F681DAD532637%215295.entry?ocid=TXT_TAGLM_WL_domore_092008>>
>
_________________________________________________________________
See how Windows Mobile brings your life together—at home, work, or on the go.
http://clk.atdmt.com/MRT/go/msnnkwxp1020093182mrt/direct/01/