Thanks for that link (and a note to self; don't ask a question on the list just before going on vacation...)
Perhaps I don't understand the patch, but It seems, that the it is only meant to avoid recrawling content that hasn't changed. It doesn't really have to do with avoiding a rebuild of the entire index if I add a document; or does it? Does Nutch have the ability to add to an index without a complete rebuild, or is a complete rebuild required if I add even a single document? Furthermore, even if I were to decide that the complete rebuild is acceptable, Nutch is still discarding my custom fields from all documents that are not being updated. Why is this happening? I appreciate the help; thanks. -Charlie On 4/14/07, rubdabadub <[EMAIL PROTECTED]> wrote:
Hi Cahrlie: On 4/14/07, c wanek <[EMAIL PROTECTED]> wrote: > Greetings, > > Now I'm at the point where I would like to add to my crawl, with a new set > of seed urls. Using a variation on the recrawl script on the wiki, I can > make this happen, but I am running into a what is, for me, a showstopper > issue. The custom fields I added to the documents of the first crawl are > lost when the documents from the second crawl are added to the index. Nutch is all about writing once. All operation write once this is how map-reduce works.. This is why incremental crawling is difficult. But :-) http://issues.apache.org/jira/browse/NUTCH-61 Like you many others want this to happen. And to the best of my knowledge Andrzej Bialecki will be addressing the issue after 0.9 release .. which is anytime now :-) So you might give it a go with Nutch-61 but NOTE it doesn't work with current trunk. Regards raj
