Manu,
The only way I was able to figure out why nutch was not crawling Urls that I
was expecting it to crawl was by digging into the code and adding extra
logging lines. I suggest you look at org.apache.nutch.fetcher.Fetcher.run()
and get an idea what it's doing. Also, look at Fetcher.handleRedirect(). Put
a whole bunch of extra logging lines in that file to figure out if either a
filter or a Normalizer is stripping out Urls that you want crawled. You can
also try disabling all Normalizers by adding something like this to your
nutch-site.xml file. Note that I stripped out just about everything. You
might only want to strip out the Normalizers. See the original settings in
nutch-default.xml.

<property>
  <name>plugin.includes</name>
  <value>protocol-http|parse-(text|html|js)|scoring-opic</value>
</property>


On Thu, Sep 25, 2008 at 1:53 PM, Manu Warikoo <[EMAIL PROTECTED]> wrote:

>
> hi,
> Thanks for responding.
> Just tried the changes that you suggested, no change.
> log files look exactly the same expect that now the dir ref comes up with
> only 2 /.
> any other possible things?
> mw> Date: Fri, 26 Sep 2008 01:19:14 +0530> From: [EMAIL PROTECTED]>
> To: [email protected]> Subject: Re: FW: Indexing Files on Local
> File System> > hi,> You should change the url as file://C:/MyData/ and also
> in> crawl-urlfilter.txt change the file:// line to> +^file://C:/MyData/*> >
> On Thu, Sep 25, 2008 at 11:42 PM, Manu Warikoo <[EMAIL PROTECTED]>
> wrote:> > > Hi,> >> > I am running Nutch 0.9 and am attempting to use it to
> index files on my> > local file system without much luck. I believe I have
> configured things> > correctly, however, no files are being indexed and no
> errors being reported.> > Note that I have looked thru the various posts on
> this topic on the mailing> > list and tired various variations on the
> configuration.> >> > I am providing details of my configuration and log
> files below. I would> > appreciate any insight people might have.> > Best,>
> > mw> >> > Details:> > OS: Windows Vista (note I have turned off defender
> and firewall)> > <comand> bin/nutch crawl urls -dir crawl_results -depth 4
> -topN 500 >&> > logs/crawl.log> > urls files contains only> >
> ```````````````````````````````````````````````````> > file:///C:/MyData/>
> >> > ```````````````````````````````````````````````````> > Nutch-site.xml>
> > `````````````````````````````````````> > <property>> >
> <name>http.agent.url</name>> > <value></value>> >
> <description>none</description>> > </property>> > <property>> >
> <name>http.agent.email</name>> > <value>none</value>> >
> <description></description>> > </property>> >> > <property>> >
> <name>plugin.includes</name>> >> >
> <value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>>
> > </property>> > <property>> > <name>file.content.limit</name>
> <value>-1</value>> > </property>> > </configuration>> >
> ```````````````````````````````````````````````````> > crawl-urlfilters.txt>
> > ```````````````````````````````````````````````````> > # The url filter
> file used by the crawl command.> > # Better for intranet crawling.> > # Be
> sure to change MY.DOMAIN.NAME to your domain name.> > # Each non-comment,
> non-blank line contains a regular expression> > # prefixed by '+' or '-'.
> The first matching pattern in the file> > # determines whether a URL is
> included or ignored. If no pattern> > # matches, the URL is ignored.> > #
> skip file:, ftp:, & mailto: urls> > # -^(file|ftp|mailto):> > # skip
> http:, ftp:, & mailto: urls> > -^(http|ftp|mailto):> > # skip image and
> other suffixes we can't yet parse> >> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$>
> > # skip URLs containing certain characters as probable queries, etc.> >
> [EMAIL PROTECTED]> > # skip URLs with slash-delimited segment that repeats 3+ 
> times,
> to break> > loops> > # -.*(/.+?)/.*?\1/.*?\1/> > # accept hosts in
> MY.DOMAIN.NAME> > # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/> > # skip
> everything else> > # -.> > # get everything else> > +^file:///C:/MyData/*> >
> -.*> > ```````````````````````````````````````````````````> >> >
> ------------------------------> > Want to do more with Windows Live? Learn
> "10 hidden secrets" from Jamie. Learn> > Now<
> http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns%21550F681DAD532637%215295.entry?ocid=TXT_TAGLM_WL_domore_092008>>
> >
> _________________________________________________________________
> See how Windows Mobile brings your life together—at home, work, or on the
> go.
> http://clk.atdmt.com/MRT/go/msnnkwxp1020093182mrt/direct/01/

Reply via email to