hi sudhendra seshachala thx so much for ur code. yes ,i want it .
On 4/5/06, sudhendra seshachala <[EMAIL PROTECTED]> wrote: > > I just modified search.jsp. Basically set the content type based on > document type I was querying. > Rest is handled protocol and browser. > > I can send the code if you would like. > > Thanks > > kauu <[EMAIL PROTECTED]> wrote: > thx for ur idea!! > but i get a question . > how to modify the search.jsp and cached servlet to view word and pdf as > demanded by user seamlessly. > > > > On 4/1/06, Vertical Search wrote: > > > > Nutchians, > > I have tried to document the sequence of steps to adopt nutch to crawl > and > > search local file system on windows machine. > > I have been able to do it successfully using nutch 0.8 Dev > > The configuration are as follows > > *Inspiron 630m > > Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine > > Windows XP > > Professional)* > > *If some can review it, it will be very helpful.* > > > > Crawling the local filesystem with nutch > > Platform: Microsoft / nutch 0.8 Dev > > For a linux version, please refer to > > > http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch > > The link did help me get it off the ground. > > > > I have been working on adopting nutch in a vertical domain. All of a > > sudden, > > I was asked to develop a proof of concept > > to adopt nutch to crawl and search local file syste, > > Initially I did face some problems. But some mail archieves did help me > > proceed further. > > The intention is to provide a overview of steps to crawl local file > > systems > > and search through the browser. > > > > I downloaded the nuctch nightly from > > 1. Create the environment variable such as "NUTCH_HOME". (Not mandatory, > > but > > helps) > > 2. Extract the downloaded nightly build. > > 3. Create a folder --> c:/LocalSearch --> copied the following folders > and > > librariees > > 1. bin/ > > 2. conf/ > > 3. *.job, *.jar and *.war files > > 4. urls/ > > 5. Plugins folder > > 4. Modify the nutch-site.xml to include the Plugin folder > > 5. Modify the nutch-site.xml to include the includes. An example is as > > follows > > > > > > > > > > > > > > > plugin.includes > > > > > protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url) > > > > > > > > file.content.limit -1 > > > > > > > > > 6. Modify crawl-urlfilter.txt > > Remember we have to crawl the local file system. Hence we have to modify > > the > > entries as follows > > > > #skip http:, ftp:, & mailto: urls > > ##-^(file|ftp|mailto): > > > > -^(http|ftp|mailto): > > > > #skip image and other suffixes we can't yet parse > > > > > -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ > > > > #skip URLs containing certain characters as probable queries, etc. > > [EMAIL PROTECTED] > > > > #accept hosts in MY.DOMAIN.NAME > > #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ > > > > #accecpt anything else > > +.* > > > > 7. urls folder > > Create a file for all the urls to be crawled. The file should have the > > urls > > as below > > save the file under the urls folder. > > > > The directories should be in "file://" format. Example entries were as > > follows > > > > file://c:/resumes/word > > file://c:/resumes/pdf > > > > #file:///data/readings/semanticweb/ > > > > Nutch recognises that the third line does not contain a valid file-url > and > > skips it > > > > As suggested by the link > > 8. Ignoring the parent directories. As suggested in the linux flavor of > > local fs crawl, I did modify the code in > > org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse( > > java.io.File f). > > > > I changed the following line: > > > > this.content = list2html(f.listFiles(), path, "/".equals(path) ? false : > > true); > > to > > > > this.content = list2html(f.listFiles(), path, false); > > and recompiled. > > > > 9. Compile the changes. Just compiled the whole source code base. did > not > > take more than 2 minutes. > > > > 10. Crawling the file system. > > on my desktop, I have a short cut to "cygdrive", double click > > pwd. > > cd ../../cygdrive/c/$NUTCH_HOME > > > > Execute > > bin/nutch crawl urls -dir c:/localfs/database > > > > Voila, that is it, After 20 minutes, the files were indexed, merged and > > all > > done. > > > > 11. extracted the nutch-o.8-dev.war file to /webapps/ROOT > > folder > > > > Opened the nutch-site.xml and added the following snippet to reflect the > > search folder > > > > > searcher.dir > > c:/localfs/database > > > > Path to root of crawl. This directory is searched (in > > order) for either the file search-servers.txt, containing a list of > > distributed search servers, or the directory "index" containing > > merged indexes, or the directory "segments" containing segment > > indexes. > > > > > > > > > 12. Searching locally was a bit slow. So I changed the hosts.ini file to > > map > > machine name to localhost. That increased search considerably. > > > > 13. Modified the search.jsp and cached servlet to view word and pdf as > > demanded by user seamlessly. > > > > > > I hope this helps folks who are trying to adopt nutch for local file > > system. > > Personally I believe corporates should adopt nutch rather buying google > > appliance :) > > > > > > > -- > www.babatu.com > > > > Sudhi Seshachala > http://sudhilogs.blogspot.com/ > > > > > --------------------------------- > New Yahoo! Messenger with Voice. Call regular phones from your PC and save > big. > -- www.babatu.com
