Thanx very much Jack. You solve my problem. I have to make necessary changes. If I got some difficulty again, I will never forget to wake you up. Thanx very much
On 11/30/05, Jack Tang <[EMAIL PROTECTED]> wrote: > > Hi > > I hope this helps > http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch > > /Jack > > On 11/30/05, Arun Kumar Sharma <[EMAIL PROTECTED]> wrote: > > Nutch Geeks- > > > > I want to do local hard-disk crawling. I want to know what I > need to do for this.I find this article helpful > > " > http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6 > " > > > > But I need little more clarification, > > > > 1.Can u send me default cofiguration that I need to make in > crawl-urlfilter.txt for local files spidering ? Make necessary changes in > file content below > > > > file content below: > > > > # skip file:, ftp:, & mailto: urls > > -^(http|ftp|mailto|https): > > > > # skip image and other suffixes we can't yet parse > > > -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ > > > > # skip URLs containing certain characters as probable queries, etc. > > [EMAIL PROTECTED] > > > > # accept hosts in MY.DOMAIN.NAME > > +^http://([a-z0-9]*\.)*www.mysite.com/ > > > > # skip everything else > > -. > > > > after this I add add single entry in my nutch-site.xml file > > > > <nutch-conf> > > <property> > > <name>plugin.includes > </name> > <value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value> > > </property> > > </nutch-conf> > > > > Is it correct ? if not what I need to change. > > > > If I do this I got following error : > > > > "051130 102544 SEVERE org.apache.nutch.plugin.PluginRuntimeException: > > extension > point: org.apache.nutch.searcher.QueryFilter does not exist. > > java.lang.ExceptionInInitializerError" > > > > 2. In the case of local hard-disk crawling, what I need to add in > urls.txt? > > > > 2. I want to crawl both pdf and ms-word files , How I can include > plugins for that? What necessary configuration required for that in > nutch-site.xml file? > > > > answer awaited anxiously............ > > > > Bill Goffe <[EMAIL PROTECTED]> wrote: Arun - > > > > I suspect others will mention this too, but see > > > http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6 > > > > - Bill > > > > > > > I want to crawl and index local system files, is there any way to > do this using nutch? What I need to do and what configuration changes > are required? I am very new to nutch so need your help in this regards. > > > thanx in adavance for quick and good response. > > > > > > > > > Regards, > > > > > > Arun Kumar Sharma (Tech Lead -Java/J2EE) > > > Mob: +91.981.529.5761 > > > > > > > > > > > > > > > > > > --------------------------------- > > > Enjoy this Diwali with Y! India Click here > > -- > > *------------------------------------------------------* > > | Bill Goffe [EMAIL PROTECTED] | > > | Department of Economics voice: (315) 312-3444 | > > | SUNY Oswego fax: (315) 312-5444 | > > | 416 Mahar Hall | > > | Oswego, NY 13126 | > > > *--------*------------------------------------------------------*-----------* > > | "He's better about shaving his legs than I am. The pressure's on me > to | > > | keep my legs > smooth." | > > | -- Sheryl Crow, on her boyfriend Lance Armstrong. "Crow's > Armstrong | > > | Song: 'Make 'Em Suffer,'" July 15, 2005, CNN.com > | > > > *---------------------------------------------------------------------------* > > > > > > > > > > > > Regards, > > > > Arun Kumar Sharma (Tech Lead -Java/J2EE) > > Mob: +91.981.529.5761 > > > > > > > > > > > > --------------------------------- > > Enjoy this Diwali with Y! India Click here > > > > > -- > Keep Discovering ... ... > http://www.jroller.com/page/jmars >