Re: Indexing Files on Local File System

Srinivas Gokavarapu Thu, 25 Sep 2008 22:19:11 -0700

hi,
           Check this link For Crawling local pages in
nutch<http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch>.
Follow the steps in this site and check once


On Fri, Sep 26, 2008 at 3:24 AM, Kevin MacDonald <[EMAIL PROTECTED]>wrote:

> Manu,
> The only way I was able to figure out why nutch was not crawling Urls that
> I
> was expecting it to crawl was by digging into the code and adding extra
> logging lines. I suggest you look at org.apache.nutch.fetcher.Fetcher.run()
> and get an idea what it's doing. Also, look at Fetcher.handleRedirect().
> Put
> a whole bunch of extra logging lines in that file to figure out if either a
> filter or a Normalizer is stripping out Urls that you want crawled. You can
> also try disabling all Normalizers by adding something like this to your
> nutch-site.xml file. Note that I stripped out just about everything. You
> might only want to strip out the Normalizers. See the original settings in
> nutch-default.xml.
>
> <property>
>  <name>plugin.includes</name>
>   <value>protocol-http|parse-(text|html|js)|scoring-opic</value>
> </property>
>
>
> On Thu, Sep 25, 2008 at 1:53 PM, Manu Warikoo <[EMAIL PROTECTED]>
> wrote:
>
> >
> > hi,
> > Thanks for responding.
> > Just tried the changes that you suggested, no change.
> > log files look exactly the same expect that now the dir ref comes up with
> > only 2 /.
> > any other possible things?
> > mw> Date: Fri, 26 Sep 2008 01:19:14 +0530> From: [EMAIL PROTECTED]
> >
> > To: [email protected]> Subject: Re: FW: Indexing Files on
> Local
> > File System> > hi,> You should change the url as file://C:/MyData/ and
> also
> > in> crawl-urlfilter.txt change the file:// line to> +^file://C:/MyData/*>
> >
> > On Thu, Sep 25, 2008 at 11:42 PM, Manu Warikoo <[EMAIL PROTECTED]>
> > wrote:> > > Hi,> >> > I am running Nutch 0.9 and am attempting to use it
> to
> > index files on my> > local file system without much luck. I believe I
> have
> > configured things> > correctly, however, no files are being indexed and
> no
> > errors being reported.> > Note that I have looked thru the various posts
> on
> > this topic on the mailing> > list and tired various variations on the
> > configuration.> >> > I am providing details of my configuration and log
> > files below. I would> > appreciate any insight people might have.> >
> Best,>
> > > mw> >> > Details:> > OS: Windows Vista (note I have turned off defender
> > and firewall)> > <comand> bin/nutch crawl urls -dir crawl_results -depth
> 4
> > -topN 500 >&> > logs/crawl.log> > urls files contains only> >
> > ```````````````````````````````````````````````````> >
> file:///C:/MyData/>
> > >> > ```````````````````````````````````````````````````> >
> Nutch-site.xml>
> > > `````````````````````````````````````> > <property>> >
> > <name>http.agent.url</name>> > <value></value>> >
> > <description>none</description>> > </property>> > <property>> >
> > <name>http.agent.email</name>> > <value>none</value>> >
> > <description></description>> > </property>> >> > <property>> >
> > <name>plugin.includes</name>> >> >
> >
> <value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>>
> > > </property>> > <property>> > <name>file.content.limit</name>
> > <value>-1</value>> > </property>> > </configuration>> >
> > ```````````````````````````````````````````````````> >
> crawl-urlfilters.txt>
> > > ```````````````````````````````````````````````````> > # The url filter
> > file used by the crawl command.> > # Better for intranet crawling.> > #
> Be
> > sure to change MY.DOMAIN.NAME to your domain name.> > # Each
> non-comment,
> > non-blank line contains a regular expression> > # prefixed by '+' or '-'.
> > The first matching pattern in the file> > # determines whether a URL is
> > included or ignored. If no pattern> > # matches, the URL is ignored.> > #
> > skip file:, ftp:, & mailto: urls> > # -^(file|ftp|mailto):> > # skip
> > http:, ftp:, & mailto: urls> > -^(http|ftp|mailto):> > # skip image and
> > other suffixes we can't yet parse> >> >
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$>
> > > # skip URLs containing certain characters as probable queries, etc.> >
> > [EMAIL PROTECTED]> > # skip URLs with slash-delimited segment that repeats 
> > 3+
> times,
> > to break> > loops> > # -.*(/.+?)/.*?\1/.*?\1/> > # accept hosts in
> > MY.DOMAIN.NAME> > # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/> > # skip
> > everything else> > # -.> > # get everything else> >
> +^file:///C:/MyData/*> >
> > -.*> > ```````````````````````````````````````````````````> >> >
> > ------------------------------> > Want to do more with Windows Live?
> Learn
> > "10 hidden secrets" from Jamie. Learn> > Now<
> >
> http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns%21550F681DAD532637%215295.entry?ocid=TXT_TAGLM_WL_domore_092008
> >>
> > >
> > _________________________________________________________________
> > See how Windows Mobile brings your life together—at home, work, or on the
> > go.
> > http://clk.atdmt.com/MRT/go/msnnkwxp1020093182mrt/direct/01/
>

Re: Indexing Files on Local File System

Reply via email to