maintainability of nutch - building incremental index

Koe Black Thu, 29 Nov 2007 17:38:33 -0800

Hello All,
I have a question.

our scenario is:
1. Crawl the db initially and set the index to run the
application
2. Add new URLs to webdb (not a problem)
3. we want to just crawl new websites (from step 2)
and add the searched result into the db and index
created initially - we want to do this
programmatically. (cannot find the best way how to do
it)


In a nut shell, When user first created the nutch
index by crawling the websites, how do user can
continue building incremental index on top of existing
build index. 

We can not figure out how to do it efficiently. 

We have found a solution that you can build a new
index on a new crawl directory, then merge the crawl
db directory created during the inital crawl with the
new crawl directory into another crawl directory, then
change the WEB-INF/nutch- site.xml and restart tomcat
server. This is not a good solution since it requires
a huge amount of disk space (our index is 1TB).  We
also want to do it programmatically, we can use lucene
api to create a new document and put it into the index
directory, however, we don't know how to get segments
and merge them into the exiting segments directory. 
we  believe segments are holding the details of the
index and will be retrieved during the search. 

Any ideas for our problem.

Thank you


      
____________________________________________________________________________________
Get easy, one-click access to your favorites. 
Make Yahoo! your homepage.
http://www.yahoo.com/r/hs

maintainability of nutch - building incremental index

Reply via email to