Manu, The only way I was able to figure out why nutch was not crawling Urls that I was expecting it to crawl was by digging into the code and adding extra logging lines. I suggest you look at org.apache.nutch.fetcher.Fetcher.run() and get an idea what it's doing. Also, look at Fetcher.handleRedirect(). Put a whole bunch of extra logging lines in that file to figure out if either a filter or a Normalizer is stripping out Urls that you want crawled. You can also try disabling all Normalizers by adding something like this to your nutch-site.xml file. Note that I stripped out just about everything. You might only want to strip out the Normalizers. See the original settings in nutch-default.xml.
<property> <name>plugin.includes</name> <value>protocol-http|parse-(text|html|js)|scoring-opic</value> </property> On Thu, Sep 25, 2008 at 1:53 PM, Manu Warikoo <[EMAIL PROTECTED]> wrote: > > hi, > Thanks for responding. > Just tried the changes that you suggested, no change. > log files look exactly the same expect that now the dir ref comes up with > only 2 /. > any other possible things? > mw> Date: Fri, 26 Sep 2008 01:19:14 +0530> From: [EMAIL PROTECTED]> > To: [email protected]> Subject: Re: FW: Indexing Files on Local > File System> > hi,> You should change the url as file://C:/MyData/ and also > in> crawl-urlfilter.txt change the file:// line to> +^file://C:/MyData/*> > > On Thu, Sep 25, 2008 at 11:42 PM, Manu Warikoo <[EMAIL PROTECTED]> > wrote:> > > Hi,> >> > I am running Nutch 0.9 and am attempting to use it to > index files on my> > local file system without much luck. I believe I have > configured things> > correctly, however, no files are being indexed and no > errors being reported.> > Note that I have looked thru the various posts on > this topic on the mailing> > list and tired various variations on the > configuration.> >> > I am providing details of my configuration and log > files below. I would> > appreciate any insight people might have.> > Best,> > > mw> >> > Details:> > OS: Windows Vista (note I have turned off defender > and firewall)> > <comand> bin/nutch crawl urls -dir crawl_results -depth 4 > -topN 500 >&> > logs/crawl.log> > urls files contains only> > > ```````````````````````````````````````````````````> > file:///C:/MyData/> > >> > ```````````````````````````````````````````````````> > Nutch-site.xml> > > `````````````````````````````````````> > <property>> > > <name>http.agent.url</name>> > <value></value>> > > <description>none</description>> > </property>> > <property>> > > <name>http.agent.email</name>> > <value>none</value>> > > <description></description>> > </property>> >> > <property>> > > <name>plugin.includes</name>> >> > > <value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>> > > </property>> > <property>> > <name>file.content.limit</name> > <value>-1</value>> > </property>> > </configuration>> > > ```````````````````````````````````````````````````> > crawl-urlfilters.txt> > > ```````````````````````````````````````````````````> > # The url filter > file used by the crawl command.> > # Better for intranet crawling.> > # Be > sure to change MY.DOMAIN.NAME to your domain name.> > # Each non-comment, > non-blank line contains a regular expression> > # prefixed by '+' or '-'. > The first matching pattern in the file> > # determines whether a URL is > included or ignored. If no pattern> > # matches, the URL is ignored.> > # > skip file:, ftp:, & mailto: urls> > # -^(file|ftp|mailto):> > # skip > http:, ftp:, & mailto: urls> > -^(http|ftp|mailto):> > # skip image and > other suffixes we can't yet parse> >> > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$> > > # skip URLs containing certain characters as probable queries, etc.> > > [EMAIL PROTECTED]> > # skip URLs with slash-delimited segment that repeats 3+ > times, > to break> > loops> > # -.*(/.+?)/.*?\1/.*?\1/> > # accept hosts in > MY.DOMAIN.NAME> > # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/> > # skip > everything else> > # -.> > # get everything else> > +^file:///C:/MyData/*> > > -.*> > ```````````````````````````````````````````````````> >> > > ------------------------------> > Want to do more with Windows Live? Learn > "10 hidden secrets" from Jamie. Learn> > Now< > http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns%21550F681DAD532637%215295.entry?ocid=TXT_TAGLM_WL_domore_092008>> > > > _________________________________________________________________ > See how Windows Mobile brings your life together—at home, work, or on the > go. > http://clk.atdmt.com/MRT/go/msnnkwxp1020093182mrt/direct/01/
