Re: Crawling the local file system with Nutch - Document-
hi sudhendra seshachala thx so much for ur code. yes ,i want it . On 4/5/06, sudhendra seshachala [EMAIL PROTECTED] wrote: I just modified search.jsp. Basically set the content type based on document type I was querying. Rest is handled protocol and browser. I can send the code if you would like. Thanks kauu [EMAIL PROTECTED] wrote: thx for ur idea!! but i get a question . how to modify the search.jsp and cached servlet to view word and pdf as demanded by user seamlessly. On 4/1/06, Vertical Search wrote: Nutchians, I have tried to document the sequence of steps to adopt nutch to crawl and search local file system on windows machine. I have been able to do it successfully using nutch 0.8 Dev The configuration are as follows *Inspiron 630m Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine Windows XP Professional)* *If some can review it, it will be very helpful.* Crawling the local filesystem with nutch Platform: Microsoft / nutch 0.8 Dev For a linux version, please refer to http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch The link did help me get it off the ground. I have been working on adopting nutch in a vertical domain. All of a sudden, I was asked to develop a proof of concept to adopt nutch to crawl and search local file syste, Initially I did face some problems. But some mail archieves did help me proceed further. The intention is to provide a overview of steps to crawl local file systems and search through the browser. I downloaded the nuctch nightly from 1. Create the environment variable such as NUTCH_HOME. (Not mandatory, but helps) 2. Extract the downloaded nightly build. 3. Create a folder -- c:/LocalSearch -- copied the following folders and librariees 1. bin/ 2. conf/ 3. *.job, *.jar and *.war files 4. urls/ 5. Plugins folder 4. Modify the nutch-site.xml to include the Plugin folder 5. Modify the nutch-site.xml to include the includes. An example is as follows plugin.includes protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url) file.content.limit -1 6. Modify crawl-urlfilter.txt Remember we have to crawl the local file system. Hence we have to modify the entries as follows #skip http:, ftp:, mailto: urls ##-^(file|ftp|mailto): -^(http|ftp|mailto): #skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ #skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] #accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ #accecpt anything else +.* 7. urls folder Create a file for all the urls to be crawled. The file should have the urls as below save the file under the urls folder. The directories should be in file:// format. Example entries were as follows file://c:/resumes/word file://c:/resumes/pdf #file:///data/readings/semanticweb/ Nutch recognises that the third line does not contain a valid file-url and skips it As suggested by the link 8. Ignoring the parent directories. As suggested in the linux flavor of local fs crawl, I did modify the code in org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse( java.io.File f). I changed the following line: this.content = list2html(f.listFiles(), path, /.equals(path) ? false : true); to this.content = list2html(f.listFiles(), path, false); and recompiled. 9. Compile the changes. Just compiled the whole source code base. did not take more than 2 minutes. 10. Crawling the file system. on my desktop, I have a short cut to cygdrive, double click pwd. cd ../../cygdrive/c/$NUTCH_HOME Execute bin/nutch crawl urls -dir c:/localfs/database Voila, that is it, After 20 minutes, the files were indexed, merged and all done. 11. extracted the nutch-o.8-dev.war file to /webapps/ROOT folder Opened the nutch-site.xml and added the following snippet to reflect the search folder searcher.dir c:/localfs/database Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory index containing merged indexes, or the directory segments containing segment indexes. 12. Searching locally was a bit slow. So I changed the hosts.ini file to map machine name to localhost. That increased search considerably. 13. Modified the search.jsp and cached servlet to view word and pdf as demanded by user seamlessly. I hope this helps folks who are trying to adopt nutch for local file system. Personally I believe corporates should adopt nutch rather buying google appliance :) -- www.babatu.com Sudhi
Re: Crawling the local file system with Nutch - Document-
I just modified search.jsp. Basically set the content type based on document type I was querying. Rest is handled protocol and browser. I can send the code if you would like. Thanks kauu [EMAIL PROTECTED] wrote: thx for ur idea!! but i get a question . how to modify the search.jsp and cached servlet to view word and pdf as demanded by user seamlessly. On 4/1/06, Vertical Search wrote: Nutchians, I have tried to document the sequence of steps to adopt nutch to crawl and search local file system on windows machine. I have been able to do it successfully using nutch 0.8 Dev The configuration are as follows *Inspiron 630m Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine Windows XP Professional)* *If some can review it, it will be very helpful.* Crawling the local filesystem with nutch Platform: Microsoft / nutch 0.8 Dev For a linux version, please refer to http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch The link did help me get it off the ground. I have been working on adopting nutch in a vertical domain. All of a sudden, I was asked to develop a proof of concept to adopt nutch to crawl and search local file syste, Initially I did face some problems. But some mail archieves did help me proceed further. The intention is to provide a overview of steps to crawl local file systems and search through the browser. I downloaded the nuctch nightly from 1. Create the environment variable such as NUTCH_HOME. (Not mandatory, but helps) 2. Extract the downloaded nightly build. 3. Create a folder -- c:/LocalSearch -- copied the following folders and librariees 1. bin/ 2. conf/ 3. *.job, *.jar and *.war files 4. urls/ 5. Plugins folder 4. Modify the nutch-site.xml to include the Plugin folder 5. Modify the nutch-site.xml to include the includes. An example is as follows plugin.includes protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url) file.content.limit -1 6. Modify crawl-urlfilter.txt Remember we have to crawl the local file system. Hence we have to modify the entries as follows #skip http:, ftp:, mailto: urls ##-^(file|ftp|mailto): -^(http|ftp|mailto): #skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ #skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] #accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ #accecpt anything else +.* 7. urls folder Create a file for all the urls to be crawled. The file should have the urls as below save the file under the urls folder. The directories should be in file:// format. Example entries were as follows file://c:/resumes/word file://c:/resumes/pdf #file:///data/readings/semanticweb/ Nutch recognises that the third line does not contain a valid file-url and skips it As suggested by the link 8. Ignoring the parent directories. As suggested in the linux flavor of local fs crawl, I did modify the code in org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse( java.io.File f). I changed the following line: this.content = list2html(f.listFiles(), path, /.equals(path) ? false : true); to this.content = list2html(f.listFiles(), path, false); and recompiled. 9. Compile the changes. Just compiled the whole source code base. did not take more than 2 minutes. 10. Crawling the file system. on my desktop, I have a short cut to cygdrive, double click pwd. cd ../../cygdrive/c/$NUTCH_HOME Execute bin/nutch crawl urls -dir c:/localfs/database Voila, that is it, After 20 minutes, the files were indexed, merged and all done. 11. extracted the nutch-o.8-dev.war file to /webapps/ROOT folder Opened the nutch-site.xml and added the following snippet to reflect the search folder searcher.dir c:/localfs/database Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory index containing merged indexes, or the directory segments containing segment indexes. 12. Searching locally was a bit slow. So I changed the hosts.ini file to map machine name to localhost. That increased search considerably. 13. Modified the search.jsp and cached servlet to view word and pdf as demanded by user seamlessly. I hope this helps folks who are trying to adopt nutch for local file system. Personally I believe corporates should adopt nutch rather buying google appliance :) -- www.babatu.com Sudhi Seshachala http://sudhilogs.blogspot.com/ - New Yahoo! Messenger with Voice. Call regular phones from your PC and save big.
Re: Crawling the local file system with Nutch - Document-
thx for ur idea!! but i get a question . how to modify the search.jsp and cached servlet to view word and pdf as demanded by user seamlessly. On 4/1/06, Vertical Search [EMAIL PROTECTED] wrote: Nutchians, I have tried to document the sequence of steps to adopt nutch to crawl and search local file system on windows machine. I have been able to do it successfully using nutch 0.8 Dev The configuration are as follows *Inspiron 630m Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine Windows XP Professional)* *If some can review it, it will be very helpful.* Crawling the local filesystem with nutch Platform: Microsoft / nutch 0.8 Dev For a linux version, please refer to http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch The link did help me get it off the ground. I have been working on adopting nutch in a vertical domain. All of a sudden, I was asked to develop a proof of concept to adopt nutch to crawl and search local file syste, Initially I did face some problems. But some mail archieves did help me proceed further. The intention is to provide a overview of steps to crawl local file systems and search through the browser. I downloaded the nuctch nightly from 1. Create the environment variable such as NUTCH_HOME. (Not mandatory, but helps) 2. Extract the downloaded nightly build. Dont build yet 3. Create a folder -- c:/LocalSearch -- copied the following folders and librariees 1. bin/ 2. conf/ 3. *.job, *.jar and *.war files 4. urls/ URLS folder 5. Plugins folder 4. Modify the nutch-site.xml to include the Plugin folder 5. Modify the nutch-site.xml to include the includes. An example is as follows ?xml version=1.0? ?xml-stylesheet type=text/xsl href=nutch-conf.xsl? !-- Put site-specific property overrides in this file. -- nutch-conf property nameplugin.includes/name valueprotocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)/value /property property namefile.content.limit/name value-1/value /property /nutch-conf 6. Modify crawl-urlfilter.txt Remember we have to crawl the local file system. Hence we have to modify the entries as follows #skip http:, ftp:, mailto: urls ##-^(file|ftp|mailto): -^(http|ftp|mailto): #skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ #skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] #accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ #accecpt anything else +.* 7. urls folder Create a file for all the urls to be crawled. The file should have the urls as below save the file under the urls folder. The directories should be in file:// format. Example entries were as follows file://c:/resumes/word file:///c:/resumes/word file://c:/resumes/pdf file:///c:/resumes/pdf #file:///data/readings/semanticweb/ Nutch recognises that the third line does not contain a valid file-url and skips it As suggested by the link 8. Ignoring the parent directories. As suggested in the linux flavor of local fs crawl, I did modify the code in org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse( java.io.File f). I changed the following line: this.content = list2html(f.listFiles(), path, /.equals(path) ? false : true); to this.content = list2html(f.listFiles(), path, false); and recompiled. 9. Compile the changes. Just compiled the whole source code base. did not take more than 2 minutes. 10. Crawling the file system. on my desktop, I have a short cut to cygdrive, double click pwd. cd ../../cygdrive/c/$NUTCH_HOME Execute bin/nutch crawl urls -dir c:/localfs/database Voila, that is it, After 20 minutes, the files were indexed, merged and all done. 11. extracted the nutch-o.8-dev.war file to TOMCAT_HOME/webapps/ROOT folder Opened the nutch-site.xml and added the following snippet to reflect the search folder property namesearcher.dir/name valuec:/localfs/database/value description Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory index containing merged indexes, or the directory segments containing segment indexes. /description /property 12. Searching locally was a bit slow. So I changed the hosts.ini file to map machine name to localhost. That increased search considerably. 13. Modified the search.jsp and cached servlet to view word and pdf as demanded by user seamlessly. I hope this helps folks who are trying to adopt nutch for local file system. Personally I believe corporates should adopt nutch rather buying google appliance :) -- www.babatu.com