incremental crawling

c wanek Fri, 13 Apr 2007 15:28:49 -0700

Greetings,

I've been working with Nutch for a while now, and can more-or-less make it
do what I want, but certain things are still not clear to me.


In short, I am interested in crawling a portion of the web, and adding my
own custom fields to the documents that I retrieve.  I use a small list of
seed urls to bootstrap the crawl, then restrict the crawl based on
information I have in my own database.  The value in my custom field also
comes from my database.

I've written an url filter to direct the crawl as it leaves the bootstrap
url list.  I also have an indexing filter to add my fields to each document
fetched, and query filters to allow me to search on the custom fields.  The
wiki examples were indispensable for getting me this far, and thanks go out
to those contributors.

Now I'm at the point where I would like to add to my crawl, with a new set
of seed urls.  Using a variation on the recrawl script on the wiki, I can
make this happen, but I am running into a what is, for me, a showstopper
issue.  The custom fields I added to the documents of the first crawl are
lost when the documents from the second crawl are added to the index.

It appears this is because rather than merge the new documents into the
existing index, Nutch is creating an entirely new index, expecting my
indexing filter to add the custom fields all over again.  I could do this,
but as my index grows to millions of pages, I suspect this approach will not
scale.  Eventually I will want to be doing many small crawls each day.

So my question: Is it possible to add documents to an index without having
to recreate the entire index?  I have not been able to find examples of this
in the documentation, lists or wiki.  If that's not possible, how do
I preserve my custom fields, and how can recreating the index with every
single document addition scale to millions of pages?

My apologies if this is already documented someplace, and if so, I would
appreciate a pointer to that documentation.

Thanks,
Charlie

incremental crawling

Reply via email to