Hi

I hope this helps
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch

/Jack

On 11/30/05, Arun Kumar Sharma <[EMAIL PROTECTED]> wrote:
> Nutch Geeks-
>
>         I want to do local hard-disk crawling. I  want to know what I need to 
> do for this.I find this article helpful
>   
> "http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6";
>
>   But I need little more clarification,
>
>   1.Can u send me default cofiguration that I need to make in  
> crawl-urlfilter.txt for local files spidering ? Make necessary changes  in 
> file content below
>
>     file content below:
>
>     # skip file:, ftp:, & mailto: urls
>     -^(http|ftp|mailto|https):
>
>     # skip image and other suffixes we can't yet parse
>     
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
>
>     # skip URLs containing certain characters as probable queries, etc.
>     [EMAIL PROTECTED]
>
>     # accept hosts in MY.DOMAIN.NAME
>     +^http://([a-z0-9]*\.)*www.mysite.com/
>
>     # skip everything else
>     -.
>
>   after this I add add single entry in my nutch-site.xml file
>
>   <nutch-conf>
>     <property>
>         <name>plugin.includes</name>      
> <value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
>      </property>
>   </nutch-conf>
>
>     Is it correct ? if  not what I need to change.
>
>     If I do this I got following error :
>
>     "051130 102544 SEVERE org.apache.nutch.plugin.PluginRuntimeException:  
> extension point: org.apache.nutch.searcher.QueryFilter does not exist.
>     java.lang.ExceptionInInitializerError"
>
>   2. In the case of local hard-disk crawling, what I need to add in urls.txt?
>
>    2. I  want to crawl both pdf and ms-word files , How I can include plugins 
>  for that? What necessary configuration required for that in  nutch-site.xml 
> file?
>
>       answer awaited anxiously............
>
> Bill Goffe <[EMAIL PROTECTED]> wrote:  Arun -
>
> I suspect others will mention this too, but see
> http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6
>
>           - Bill
>
>
> >  I want to crawl and index local system files, is there any way to do  this 
> > using nutch? What I need to do and what configuration changes are  
> > required? I am very new to nutch so need your help in this regards.
> >         thanx in adavance for quick and good response.
> >
> >
> > Regards,
> >
> > Arun Kumar Sharma (Tech Lead -Java/J2EE)
> > Mob: +91.981.529.5761
> >
> >
> >
> >
> >
> > ---------------------------------
> >  Enjoy this Diwali with Y! India Click here
> --
>          *------------------------------------------------------*
>          | Bill Goffe                 [EMAIL PROTECTED]          |
>          | Department of Economics    voice: (315) 312-3444     |
>          | SUNY Oswego                fax:   (315) 312-5444     |
>          | 416 Mahar Hall                  |
>          | Oswego, NY  13126                                    |
> *--------*------------------------------------------------------*-----------*
> | "He's better about shaving his legs than I am. The pressure's on me to    |
> | keep my legs smooth."                                                     |
> |  -- Sheryl Crow, on her boyfriend Lance Armstrong. "Crow's Armstrong      |
> |     Song: 'Make 'Em Suffer,'" July 15, 2005, CNN.com                      |
> *---------------------------------------------------------------------------*
>
>
>
>
>
> Regards,
>
> Arun Kumar Sharma (Tech Lead -Java/J2EE)
> Mob: +91.981.529.5761
>
>
>
>
>
> ---------------------------------
>  Enjoy this Diwali with Y! India Click here
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to