Re: Crawling the local file system with Nutch - Document-

kauu Fri, 14 Apr 2006 08:29:02 -0700

hi sudhendra seshachala
thx so much for ur code.
yes ,i want it .


On 4/5/06, sudhendra seshachala <[EMAIL PROTECTED]> wrote:
>
> I just modified search.jsp. Basically set the content type based on
> document type I was querying.
>   Rest is handled protocol and browser.
>
>   I can send the code if you would like.
>
>   Thanks
>
> kauu <[EMAIL PROTECTED]> wrote:
>   thx for ur idea!!
> but i get a question .
> how to modify the search.jsp and cached servlet to view word and pdf as
> demanded by user seamlessly.
>
>
>
> On 4/1/06, Vertical Search wrote:
> >
> > Nutchians,
> > I have tried to document the sequence of steps to adopt nutch to crawl
> and
> > search local file system on windows machine.
> > I have been able to do it successfully using nutch 0.8 Dev
> > The configuration are as follows
> > *Inspiron 630m
> > Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine
> > Windows XP
> > Professional)*
> > *If some can review it, it will be very helpful.*
> >
> > Crawling the local filesystem with nutch
> > Platform: Microsoft / nutch 0.8 Dev
> > For a linux version, please refer to
> >
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> > The link did help me get it off the ground.
> >
> > I have been working on adopting nutch in a vertical domain. All of a
> > sudden,
> > I was asked to develop a proof of concept
> > to adopt nutch to crawl and search local file syste,
> > Initially I did face some problems. But some mail archieves did help me
> > proceed further.
> > The intention is to provide a overview of steps to crawl local file
> > systems
> > and search through the browser.
> >
> > I downloaded the nuctch nightly from
> > 1. Create the environment variable such as "NUTCH_HOME". (Not mandatory,
> > but
> > helps)
> > 2. Extract the downloaded nightly build.
> > 3. Create a folder --> c:/LocalSearch --> copied the following folders
> and
> > librariees
> > 1. bin/
> > 2. conf/
> > 3. *.job, *.jar and *.war files
> > 4. urls/
> > 5. Plugins folder
> > 4. Modify the nutch-site.xml to include the Plugin folder
> > 5. Modify the nutch-site.xml to include the includes. An example is as
> > follows
> >
> >
> >
> >
> >
> >
>
> > plugin.includes
> >
> >
> protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)
> >
>
> >
>
> > file.content.limit -1
> >
>
> >
> >
> > 6. Modify crawl-urlfilter.txt
> > Remember we have to crawl the local file system. Hence we have to modify
> > the
> > entries as follows
> >
> > #skip http:, ftp:, & mailto: urls
> > ##-^(file|ftp|mailto):
> >
> > -^(http|ftp|mailto):
> >
> > #skip image and other suffixes we can't yet parse
> >
> >
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> >
> > #skip URLs containing certain characters as probable queries, etc.
> > [EMAIL PROTECTED]
> >
> > #accept hosts in MY.DOMAIN.NAME
> > #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> >
> > #accecpt anything else
> > +.*
> >
> > 7. urls folder
> > Create a file for all the urls to be crawled. The file should have the
> > urls
> > as below
> > save the file under the urls folder.
> >
> > The directories should be in "file://" format. Example entries were as
> > follows
> >
> > file://c:/resumes/word
> > file://c:/resumes/pdf
> >
> > #file:///data/readings/semanticweb/
> >
> > Nutch recognises that the third line does not contain a valid file-url
> and
> > skips it
> >
> > As suggested by the link
> > 8. Ignoring the parent directories. As suggested in the linux flavor of
> > local fs crawl, I did modify the code in
> > org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
> > java.io.File f).
> >
> > I changed the following line:
> >
> > this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :
> > true);
> > to
> >
> > this.content = list2html(f.listFiles(), path, false);
> > and recompiled.
> >
> > 9. Compile the changes. Just compiled the whole source code base. did
> not
> > take more than 2 minutes.
> >
> > 10. Crawling the file system.
> > on my desktop, I have a short cut to "cygdrive", double click
> > pwd.
> > cd ../../cygdrive/c/$NUTCH_HOME
> >
> > Execute
> > bin/nutch crawl urls -dir c:/localfs/database
> >
> > Voila, that is it, After 20 minutes, the files were indexed, merged and
> > all
> > done.
> >
> > 11. extracted the nutch-o.8-dev.war file to /webapps/ROOT
> > folder
> >
> > Opened the nutch-site.xml and added the following snippet to reflect the
> > search folder
> >
>
> > searcher.dir
> > c:/localfs/database
> >
> > Path to root of crawl. This directory is searched (in
> > order) for either the file search-servers.txt, containing a list of
> > distributed search servers, or the directory "index" containing
> > merged indexes, or the directory "segments" containing segment
> > indexes.
> >
> >
>
> >
> > 12. Searching locally was a bit slow. So I changed the hosts.ini file to
> > map
> > machine name to localhost. That increased search considerably.
> >
> > 13. Modified the search.jsp and cached servlet to view word and pdf as
> > demanded by user seamlessly.
> >
> >
> > I hope this helps folks who are trying to adopt nutch for local file
> > system.
> > Personally I believe corporates should adopt nutch rather buying google
> > appliance :)
> >
> >
>
>
> --
> www.babatu.com
>
>
>
>   Sudhi Seshachala
>   http://sudhilogs.blogspot.com/
>
>
>
>
> ---------------------------------
> New Yahoo! Messenger with Voice. Call regular phones from your PC and save
> big.
>



--
www.babatu.com

Re: Crawling the local file system with Nutch - Document-

Reply via email to