hi,
You should change the url as file://C:/MyData/ and also in
crawl-urlfilter.txt change the file:// line to
+^file://C:/MyData/*
On Thu, Sep 25, 2008 at 11:42 PM, Manu Warikoo <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I am running Nutch 0.9 and am attempting to use it to index files on my
> local file system without much luck. I believe I have configured things
> correctly, however, no files are being indexed and no errors being reported.
> Note that I have looked thru the various posts on this topic on the mailing
> list and tired various variations on the configuration.
>
> I am providing details of my configuration and log files below. I would
> appreciate any insight people might have.
> Best,
> mw
>
> Details:
> OS: Windows Vista (note I have turned off defender and firewall)
> <comand> bin/nutch crawl urls -dir crawl_results -depth 4 -topN 500 >&
> logs/crawl.log
> urls files contains only
> ```````````````````````````````````````````````````
> file:///C:/MyData/
>
> ```````````````````````````````````````````````````
> Nutch-site.xml
> `````````````````````````````````````
> <property>
> <name>http.agent.url</name>
> <value></value>
> <description>none</description>
> </property>
> <property>
> <name>http.agent.email</name>
> <value>none</value>
> <description></description>
> </property>
>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> </property>
> <property>
> <name>file.content.limit</name> <value>-1</value>
> </property>
> </configuration>
> ```````````````````````````````````````````````````
> crawl-urlfilters.txt
> ```````````````````````````````````````````````````
> # The url filter file used by the crawl command.
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'. The first matching pattern in the file
> # determines whether a URL is included or ignored. If no pattern
> # matches, the URL is ignored.
> # skip file:, ftp:, & mailto: urls
> # -^(file|ftp|mailto):
> # skip http:, ftp:, & mailto: urls
> -^(http|ftp|mailto):
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> # -.*(/.+?)/.*?\1/.*?\1/
> # accept hosts in MY.DOMAIN.NAME
> # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> # skip everything else
> # -.
> # get everything else
> +^file:///C:/MyData/*
> -.*
> ```````````````````````````````````````````````````
>
> ------------------------------
> Want to do more with Windows Live? Learn "10 hidden secrets" from Jamie. Learn
> Now<http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns%21550F681DAD532637%215295.entry?ocid=TXT_TAGLM_WL_domore_092008>
>