Greetings, I've been working with Nutch for a while now, and can more-or-less make it do what I want, but certain things are still not clear to me.
In short, I am interested in crawling a portion of the web, and adding my own custom fields to the documents that I retrieve. I use a small list of seed urls to bootstrap the crawl, then restrict the crawl based on information I have in my own database. The value in my custom field also comes from my database. I've written an url filter to direct the crawl as it leaves the bootstrap url list. I also have an indexing filter to add my fields to each document fetched, and query filters to allow me to search on the custom fields. The wiki examples were indispensable for getting me this far, and thanks go out to those contributors. Now I'm at the point where I would like to add to my crawl, with a new set of seed urls. Using a variation on the recrawl script on the wiki, I can make this happen, but I am running into a what is, for me, a showstopper issue. The custom fields I added to the documents of the first crawl are lost when the documents from the second crawl are added to the index. It appears this is because rather than merge the new documents into the existing index, Nutch is creating an entirely new index, expecting my indexing filter to add the custom fields all over again. I could do this, but as my index grows to millions of pages, I suspect this approach will not scale. Eventually I will want to be doing many small crawls each day. So my question: Is it possible to add documents to an index without having to recreate the entire index? I have not been able to find examples of this in the documentation, lists or wiki. If that's not possible, how do I preserve my custom fields, and how can recreating the index with every single document addition scale to millions of pages? My apologies if this is already documented someplace, and if so, I would appreciate a pointer to that documentation. Thanks, Charlie
