Re: Crawling the local file system with Nutch - Document-

2006-04-14 Thread kauu
hi sudhendra seshachala
thx so much for ur code.
yes ,i want it .


On 4/5/06, sudhendra seshachala [EMAIL PROTECTED] wrote:

 I just modified search.jsp. Basically set the content type based on
 document type I was querying.
   Rest is handled protocol and browser.

   I can send the code if you would like.

   Thanks

 kauu [EMAIL PROTECTED] wrote:
   thx for ur idea!!
 but i get a question .
 how to modify the search.jsp and cached servlet to view word and pdf as
 demanded by user seamlessly.



 On 4/1/06, Vertical Search wrote:
 
  Nutchians,
  I have tried to document the sequence of steps to adopt nutch to crawl
 and
  search local file system on windows machine.
  I have been able to do it successfully using nutch 0.8 Dev
  The configuration are as follows
  *Inspiron 630m
  Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine
  Windows XP
  Professional)*
  *If some can review it, it will be very helpful.*
 
  Crawling the local filesystem with nutch
  Platform: Microsoft / nutch 0.8 Dev
  For a linux version, please refer to
 
 http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
  The link did help me get it off the ground.
 
  I have been working on adopting nutch in a vertical domain. All of a
  sudden,
  I was asked to develop a proof of concept
  to adopt nutch to crawl and search local file syste,
  Initially I did face some problems. But some mail archieves did help me
  proceed further.
  The intention is to provide a overview of steps to crawl local file
  systems
  and search through the browser.
 
  I downloaded the nuctch nightly from
  1. Create the environment variable such as NUTCH_HOME. (Not mandatory,
  but
  helps)
  2. Extract the downloaded nightly build.
  3. Create a folder -- c:/LocalSearch -- copied the following folders
 and
  librariees
  1. bin/
  2. conf/
  3. *.job, *.jar and *.war files
  4. urls/
  5. Plugins folder
  4. Modify the nutch-site.xml to include the Plugin folder
  5. Modify the nutch-site.xml to include the includes. An example is as
  follows
 
 
 
 
 
 

  plugin.includes
 
 
 protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)
 

 

  file.content.limit -1
 

 
 
  6. Modify crawl-urlfilter.txt
  Remember we have to crawl the local file system. Hence we have to modify
  the
  entries as follows
 
  #skip http:, ftp:,  mailto: urls
  ##-^(file|ftp|mailto):
 
  -^(http|ftp|mailto):
 
  #skip image and other suffixes we can't yet parse
 
 
 -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
 
  #skip URLs containing certain characters as probable queries, etc.
  [EMAIL PROTECTED]
 
  #accept hosts in MY.DOMAIN.NAME
  #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
 
  #accecpt anything else
  +.*
 
  7. urls folder
  Create a file for all the urls to be crawled. The file should have the
  urls
  as below
  save the file under the urls folder.
 
  The directories should be in file:// format. Example entries were as
  follows
 
  file://c:/resumes/word
  file://c:/resumes/pdf
 
  #file:///data/readings/semanticweb/
 
  Nutch recognises that the third line does not contain a valid file-url
 and
  skips it
 
  As suggested by the link
  8. Ignoring the parent directories. As suggested in the linux flavor of
  local fs crawl, I did modify the code in
  org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
  java.io.File f).
 
  I changed the following line:
 
  this.content = list2html(f.listFiles(), path, /.equals(path) ? false :
  true);
  to
 
  this.content = list2html(f.listFiles(), path, false);
  and recompiled.
 
  9. Compile the changes. Just compiled the whole source code base. did
 not
  take more than 2 minutes.
 
  10. Crawling the file system.
  on my desktop, I have a short cut to cygdrive, double click
  pwd.
  cd ../../cygdrive/c/$NUTCH_HOME
 
  Execute
  bin/nutch crawl urls -dir c:/localfs/database
 
  Voila, that is it, After 20 minutes, the files were indexed, merged and
  all
  done.
 
  11. extracted the nutch-o.8-dev.war file to /webapps/ROOT
  folder
 
  Opened the nutch-site.xml and added the following snippet to reflect the
  search folder
 

  searcher.dir
  c:/localfs/database
 
  Path to root of crawl. This directory is searched (in
  order) for either the file search-servers.txt, containing a list of
  distributed search servers, or the directory index containing
  merged indexes, or the directory segments containing segment
  indexes.
 
 

 
  12. Searching locally was a bit slow. So I changed the hosts.ini file to
  map
  machine name to localhost. That increased search considerably.
 
  13. Modified the search.jsp and cached servlet to view word and pdf as
  demanded by user seamlessly.
 
 
  I hope this helps folks who are trying to adopt nutch for local file
  system.
  Personally I believe corporates should adopt nutch rather buying google
  appliance :)
 
 


 --
 www.babatu.com



   Sudhi 

Re: Crawling the local file system with Nutch - Document-

2006-04-04 Thread sudhendra seshachala
I just modified search.jsp. Basically set the content type based on document 
type I was querying.
  Rest is handled protocol and browser.
   
  I can send the code if you would like.
   
  Thanks

kauu [EMAIL PROTECTED] wrote:
  thx for ur idea!!
but i get a question .
how to modify the search.jsp and cached servlet to view word and pdf as
demanded by user seamlessly.



On 4/1/06, Vertical Search wrote:

 Nutchians,
 I have tried to document the sequence of steps to adopt nutch to crawl and
 search local file system on windows machine.
 I have been able to do it successfully using nutch 0.8 Dev
 The configuration are as follows
 *Inspiron 630m
 Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine
 Windows XP
 Professional)*
 *If some can review it, it will be very helpful.*

 Crawling the local filesystem with nutch
 Platform: Microsoft / nutch 0.8 Dev
 For a linux version, please refer to
 http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
 The link did help me get it off the ground.

 I have been working on adopting nutch in a vertical domain. All of a
 sudden,
 I was asked to develop a proof of concept
 to adopt nutch to crawl and search local file syste,
 Initially I did face some problems. But some mail archieves did help me
 proceed further.
 The intention is to provide a overview of steps to crawl local file
 systems
 and search through the browser.

 I downloaded the nuctch nightly from
 1. Create the environment variable such as NUTCH_HOME. (Not mandatory,
 but
 helps)
 2. Extract the downloaded nightly build. 
 3. Create a folder -- c:/LocalSearch -- copied the following folders and
 librariees
 1. bin/
 2. conf/
 3. *.job, *.jar and *.war files
 4. urls/ 
 5. Plugins folder
 4. Modify the nutch-site.xml to include the Plugin folder
 5. Modify the nutch-site.xml to include the includes. An example is as
 follows

 
 
 
 
 

 plugin.includes

 protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)
 

 

 file.content.limit -1
 

 

 6. Modify crawl-urlfilter.txt
 Remember we have to crawl the local file system. Hence we have to modify
 the
 entries as follows

 #skip http:, ftp:,  mailto: urls
 ##-^(file|ftp|mailto):

 -^(http|ftp|mailto):

 #skip image and other suffixes we can't yet parse

 -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

 #skip URLs containing certain characters as probable queries, etc.
 [EMAIL PROTECTED]

 #accept hosts in MY.DOMAIN.NAME
 #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

 #accecpt anything else
 +.*

 7. urls folder
 Create a file for all the urls to be crawled. The file should have the
 urls
 as below
 save the file under the urls folder.

 The directories should be in file:// format. Example entries were as
 follows

 file://c:/resumes/word 
 file://c:/resumes/pdf 

 #file:///data/readings/semanticweb/

 Nutch recognises that the third line does not contain a valid file-url and
 skips it

 As suggested by the link
 8. Ignoring the parent directories. As suggested in the linux flavor of
 local fs crawl, I did modify the code in
 org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
 java.io.File f).

 I changed the following line:

 this.content = list2html(f.listFiles(), path, /.equals(path) ? false :
 true);
 to

 this.content = list2html(f.listFiles(), path, false);
 and recompiled.

 9. Compile the changes. Just compiled the whole source code base. did not
 take more than 2 minutes.

 10. Crawling the file system.
 on my desktop, I have a short cut to cygdrive, double click
 pwd.
 cd ../../cygdrive/c/$NUTCH_HOME

 Execute
 bin/nutch crawl urls -dir c:/localfs/database

 Voila, that is it, After 20 minutes, the files were indexed, merged and
 all
 done.

 11. extracted the nutch-o.8-dev.war file to /webapps/ROOT
 folder

 Opened the nutch-site.xml and added the following snippet to reflect the
 search folder
 

 searcher.dir
 c:/localfs/database
 
 Path to root of crawl. This directory is searched (in
 order) for either the file search-servers.txt, containing a list of
 distributed search servers, or the directory index containing
 merged indexes, or the directory segments containing segment
 indexes.
 
 


 12. Searching locally was a bit slow. So I changed the hosts.ini file to
 map
 machine name to localhost. That increased search considerably.

 13. Modified the search.jsp and cached servlet to view word and pdf as
 demanded by user seamlessly.


 I hope this helps folks who are trying to adopt nutch for local file
 system.
 Personally I believe corporates should adopt nutch rather buying google
 appliance :)




--
www.babatu.com



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
New Yahoo! Messenger with Voice. Call regular phones from your PC and save big.

Re: Crawling the local file system with Nutch - Document-

2006-03-31 Thread kauu
thx for ur idea!!
but i get a question .
how to modify the search.jsp and cached servlet to view word and pdf  as
demanded by user seamlessly.



On 4/1/06, Vertical Search [EMAIL PROTECTED] wrote:

 Nutchians,
 I have tried to document the sequence of steps to adopt nutch to crawl and
 search local file system on windows machine.
 I have been able to do it successfully using nutch 0.8 Dev
 The configuration are as follows
 *Inspiron 630m
 Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine
 Windows XP
 Professional)*
 *If some can review it, it will be very helpful.*

 Crawling the local filesystem with nutch
 Platform: Microsoft / nutch 0.8 Dev
 For a linux version, please refer to
 http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
 The link did help me get it off the ground.

 I have been working on adopting nutch in a vertical domain. All of a
 sudden,
 I was asked to develop a proof of concept
 to adopt nutch to crawl and search local file syste,
 Initially I did face some problems. But some mail archieves did help me
 proceed further.
 The intention is to provide a overview of steps to crawl local file
 systems
 and search through the browser.

 I downloaded the nuctch nightly from
 1. Create the environment variable such as NUTCH_HOME. (Not mandatory,
 but
 helps)
 2. Extract the downloaded nightly build. Dont build yet
 3. Create a folder -- c:/LocalSearch -- copied the following folders and
 librariees
 1. bin/
 2. conf/
 3. *.job, *.jar and *.war files
 4. urls/ URLS folder
 5. Plugins folder
 4. Modify the nutch-site.xml to include the Plugin folder
 5. Modify the nutch-site.xml to include the includes. An example is as
 follows

 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=nutch-conf.xsl?
 !-- Put site-specific property overrides in this file. --
 nutch-conf
 property
 nameplugin.includes/name

 valueprotocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)/value
 /property
 property
 namefile.content.limit/name value-1/value
 /property
 /nutch-conf

 6. Modify crawl-urlfilter.txt
 Remember we have to crawl the local file system. Hence we have to modify
 the
 entries as follows

 #skip http:, ftp:,  mailto: urls
 ##-^(file|ftp|mailto):

 -^(http|ftp|mailto):

 #skip image and other suffixes we can't yet parse

 -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

 #skip URLs containing certain characters as probable queries, etc.
 [EMAIL PROTECTED]

 #accept hosts in MY.DOMAIN.NAME
 #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

 #accecpt anything else
 +.*

 7. urls folder
 Create a file for all the urls to be crawled. The file should have the
 urls
 as below
 save the file under the urls folder.

 The directories should be in file:// format. Example entries were as
 follows

 file://c:/resumes/word file:///c:/resumes/word
 file://c:/resumes/pdf file:///c:/resumes/pdf

 #file:///data/readings/semanticweb/

 Nutch recognises that the third line does not contain a valid file-url and
 skips it

 As suggested by the link
 8. Ignoring the parent directories. As suggested in the linux flavor of
 local fs crawl, I did modify the code in
 org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
 java.io.File f).

 I changed the following line:

 this.content = list2html(f.listFiles(), path, /.equals(path) ? false :
 true);
 to

 this.content = list2html(f.listFiles(), path, false);
 and recompiled.

 9. Compile the changes. Just compiled the whole source code base. did not
 take more than 2 minutes.

 10. Crawling the file system.
 on my desktop, I have a short cut to cygdrive, double click
 pwd.
 cd ../../cygdrive/c/$NUTCH_HOME

 Execute
 bin/nutch crawl urls -dir c:/localfs/database

 Voila, that is it, After 20 minutes, the files were indexed, merged and
 all
 done.

 11. extracted the nutch-o.8-dev.war file to TOMCAT_HOME/webapps/ROOT
 folder

 Opened the nutch-site.xml and added the following snippet to reflect the
 search folder
 property
   namesearcher.dir/name
   valuec:/localfs/database/value
   description
   Path to root of crawl.  This directory is searched (in
   order) for either the file search-servers.txt, containing a list of
   distributed search servers, or the directory index containing
   merged indexes, or the directory segments containing segment
   indexes.
   /description
 /property

 12. Searching locally was a bit slow. So I changed the hosts.ini file to
 map
 machine name to localhost. That increased search considerably.

 13. Modified the search.jsp and cached servlet to view word and pdf as
 demanded by user seamlessly.


 I hope this helps folks who are trying to adopt nutch for local file
 system.
 Personally I believe corporates should adopt nutch rather buying google
 appliance :)




--
www.babatu.com