thx for ur idea!! but i get a question . how to modify the search.jsp and cached servlet to view word and pdf as demanded by user seamlessly.
On 4/1/06, Vertical Search <[EMAIL PROTECTED]> wrote: > > Nutchians, > I have tried to document the sequence of steps to adopt nutch to crawl and > search local file system on windows machine. > I have been able to do it successfully using nutch 0.8 Dev > The configuration are as follows > *Inspiron 630m > Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine > Windows XP > Professional)* > *If some can review it, it will be very helpful.* > > Crawling the local filesystem with nutch > Platform: Microsoft / nutch 0.8 Dev > For a linux version, please refer to > http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch > The link did help me get it off the ground. > > I have been working on adopting nutch in a vertical domain. All of a > sudden, > I was asked to develop a proof of concept > to adopt nutch to crawl and search local file syste, > Initially I did face some problems. But some mail archieves did help me > proceed further. > The intention is to provide a overview of steps to crawl local file > systems > and search through the browser. > > I downloaded the nuctch nightly from > 1. Create the environment variable such as "NUTCH_HOME". (Not mandatory, > but > helps) > 2. Extract the downloaded nightly build. <Dont build yet> > 3. Create a folder --> c:/LocalSearch --> copied the following folders and > librariees > 1. bin/ > 2. conf/ > 3. *.job, *.jar and *.war files > 4. urls/ <URLS folder> > 5. Plugins folder > 4. Modify the nutch-site.xml to include the Plugin folder > 5. Modify the nutch-site.xml to include the includes. An example is as > follows > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?> > <!-- Put site-specific property overrides in this file. --> > <nutch-conf> > <property> > <name>plugin.includes</name> > > <value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)</value> > </property> > <property> > <name>file.content.limit</name> <value>-1</value> > </property> > </nutch-conf> > > 6. Modify crawl-urlfilter.txt > Remember we have to crawl the local file system. Hence we have to modify > the > entries as follows > > #skip http:, ftp:, & mailto: urls > ##-^(file|ftp|mailto): > > -^(http|ftp|mailto): > > #skip image and other suffixes we can't yet parse > > -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ > > #skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > #accept hosts in MY.DOMAIN.NAME > #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ > > #accecpt anything else > +.* > > 7. urls folder > Create a file for all the urls to be crawled. The file should have the > urls > as below > save the file under the urls folder. > > The directories should be in "file://" format. Example entries were as > follows > > file://c:/resumes/word <file:///c:/resumes/word> > file://c:/resumes/pdf <file:///c:/resumes/pdf> > > #file:///data/readings/semanticweb/ > > Nutch recognises that the third line does not contain a valid file-url and > skips it > > As suggested by the link > 8. Ignoring the parent directories. As suggested in the linux flavor of > local fs crawl, I did modify the code in > org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse( > java.io.File f). > > I changed the following line: > > this.content = list2html(f.listFiles(), path, "/".equals(path) ? false : > true); > to > > this.content = list2html(f.listFiles(), path, false); > and recompiled. > > 9. Compile the changes. Just compiled the whole source code base. did not > take more than 2 minutes. > > 10. Crawling the file system. > on my desktop, I have a short cut to "cygdrive", double click > pwd. > cd ../../cygdrive/c/$NUTCH_HOME > > Execute > bin/nutch crawl urls -dir c:/localfs/database > > Voila, that is it, After 20 minutes, the files were indexed, merged and > all > done. > > 11. extracted the nutch-o.8-dev.war file to <TOMCAT_HOME>/webapps/ROOT > folder > > Opened the nutch-site.xml and added the following snippet to reflect the > search folder > <property> > <name>searcher.dir</name> > <value>c:/localfs/database</value> > <description> > Path to root of crawl. This directory is searched (in > order) for either the file search-servers.txt, containing a list of > distributed search servers, or the directory "index" containing > merged indexes, or the directory "segments" containing segment > indexes. > </description> > </property> > > 12. Searching locally was a bit slow. So I changed the hosts.ini file to > map > machine name to localhost. That increased search considerably. > > 13. Modified the search.jsp and cached servlet to view word and pdf as > demanded by user seamlessly. > > > I hope this helps folks who are trying to adopt nutch for local file > system. > Personally I believe corporates should adopt nutch rather buying google > appliance :) > > -- www.babatu.com