Re: incremental crawling

rubdabadub Sat, 14 Apr 2007 03:31:05 -0700

Hi Cahrlie:

On 4/14/07, c wanek <[EMAIL PROTECTED]> wrote:

Greetings,


Now I'm at the point where I would like to add to my crawl, with a new set
of seed urls.  Using a variation on the recrawl script on the wiki, I can
make this happen, but I am running into a what is, for me, a showstopper
issue.  The custom fields I added to the documents of the first crawl are
lost when the documents from the second crawl are added to the index.


Nutch is all about writing once. All operation write once this is how map-reduce
works.. This is why incremental crawling is difficult. But :-)

http://issues.apache.org/jira/browse/NUTCH-61

Like you many others want this to happen. And to the best of my knowledge
Andrzej Bialecki will be addressing the issue after 0.9 release .. which is
anytime now :-)

So you might give it a go with Nutch-61 but NOTE it doesn't work with
current trunk.

Regards
raj

Re: incremental crawling

Reply via email to