Re: Crawling the local file system with Nutch - Document-

kauu Fri, 31 Mar 2006 16:56:47 -0800

thx for ur idea!!
but i get a question .
how to modify the search.jsp and cached servlet to view word and pdf  as
demanded by user seamlessly.




On 4/1/06, Vertical Search <[EMAIL PROTECTED]> wrote:
>
> Nutchians,
> I have tried to document the sequence of steps to adopt nutch to crawl and
> search local file system on windows machine.
> I have been able to do it successfully using nutch 0.8 Dev
> The configuration are as follows
> *Inspiron 630m
> Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine
> Windows XP
> Professional)*
> *If some can review it, it will be very helpful.*
>
> Crawling the local filesystem with nutch
> Platform: Microsoft / nutch 0.8 Dev
> For a linux version, please refer to
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> The link did help me get it off the ground.
>
> I have been working on adopting nutch in a vertical domain. All of a
> sudden,
> I was asked to develop a proof of concept
> to adopt nutch to crawl and search local file syste,
> Initially I did face some problems. But some mail archieves did help me
> proceed further.
> The intention is to provide a overview of steps to crawl local file
> systems
> and search through the browser.
>
> I downloaded the nuctch nightly from
> 1. Create the environment variable such as "NUTCH_HOME". (Not mandatory,
> but
> helps)
> 2. Extract the downloaded nightly build. <Dont build yet>
> 3. Create a folder --> c:/LocalSearch --> copied the following folders and
> librariees
> 1. bin/
> 2. conf/
> 3. *.job, *.jar and *.war files
> 4. urls/ <URLS folder>
> 5. Plugins folder
> 4. Modify the nutch-site.xml to include the Plugin folder
> 5. Modify the nutch-site.xml to include the includes. An example is as
> follows
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <nutch-conf>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)</value>
> </property>
> <property>
> <name>file.content.limit</name> <value>-1</value>
> </property>
> </nutch-conf>
>
> 6. Modify crawl-urlfilter.txt
> Remember we have to crawl the local file system. Hence we have to modify
> the
> entries as follows
>
> #skip http:, ftp:, & mailto: urls
> ##-^(file|ftp|mailto):
>
> -^(http|ftp|mailto):
>
> #skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
>
> #skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> #accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>
> #accecpt anything else
> +.*
>
> 7. urls folder
> Create a file for all the urls to be crawled. The file should have the
> urls
> as below
> save the file under the urls folder.
>
> The directories should be in "file://" format. Example entries were as
> follows
>
> file://c:/resumes/word <file:///c:/resumes/word>
> file://c:/resumes/pdf <file:///c:/resumes/pdf>
>
> #file:///data/readings/semanticweb/
>
> Nutch recognises that the third line does not contain a valid file-url and
> skips it
>
> As suggested by the link
> 8. Ignoring the parent directories. As suggested in the linux flavor of
> local fs crawl, I did modify the code in
> org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
> java.io.File f).
>
> I changed the following line:
>
> this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :
> true);
> to
>
> this.content = list2html(f.listFiles(), path, false);
> and recompiled.
>
> 9. Compile the changes. Just compiled the whole source code base. did not
> take more than 2 minutes.
>
> 10. Crawling the file system.
>     on my desktop, I have a short cut to "cygdrive", double click
>     pwd.
>     cd ../../cygdrive/c/$NUTCH_HOME
>
>     Execute
>     bin/nutch crawl urls -dir c:/localfs/database
>
> Voila, that is it, After 20 minutes, the files were indexed, merged and
> all
> done.
>
> 11. extracted the nutch-o.8-dev.war file to <TOMCAT_HOME>/webapps/ROOT
> folder
>
> Opened the nutch-site.xml and added the following snippet to reflect the
> search folder
> <property>
>   <name>searcher.dir</name>
>   <value>c:/localfs/database</value>
>   <description>
>   Path to root of crawl.  This directory is searched (in
>   order) for either the file search-servers.txt, containing a list of
>   distributed search servers, or the directory "index" containing
>   merged indexes, or the directory "segments" containing segment
>   indexes.
>   </description>
> </property>
>
> 12. Searching locally was a bit slow. So I changed the hosts.ini file to
> map
> machine name to localhost. That increased search considerably.
>
> 13. Modified the search.jsp and cached servlet to view word and pdf as
> demanded by user seamlessly.
>
>
> I hope this helps folks who are trying to adopt nutch for local file
> system.
> Personally I believe corporates should adopt nutch rather buying google
> appliance :)
>
>


--
www.babatu.com

Re: Crawling the local file system with Nutch - Document-

Reply via email to