Re: incremental crawling

c wanek Wed, 18 Apr 2007 09:02:42 -0700

Thanks for that link (and a note to self; don't ask a question on the list
just before going on vacation...)


Perhaps I don't understand the patch, but It seems, that the it is only
meant to avoid recrawling content that hasn't changed.  It doesn't really
have to do with avoiding a rebuild of the entire index if I add a document;
or does it?

Does Nutch have the ability to add to an index without a complete rebuild,
or is a complete rebuild required if I add even a single document?

Furthermore, even if I were to decide that the complete rebuild is
acceptable, Nutch is still discarding my custom fields from all documents
that are not being updated.  Why is this happening?

I appreciate the help; thanks.
-Charlie



On 4/14/07, rubdabadub <[EMAIL PROTECTED]> wrote:

Hi Cahrlie:

On 4/14/07, c wanek <[EMAIL PROTECTED]> wrote:
> Greetings,
>
> Now I'm at the point where I would like to add to my crawl, with a new
set
> of seed urls.  Using a variation on the recrawl script on the wiki, I
can
> make this happen, but I am running into a what is, for me, a showstopper
> issue.  The custom fields I added to the documents of the first crawl
are
> lost when the documents from the second crawl are added to the index.

Nutch is all about writing once. All operation write once this is how
map-reduce
works.. This is why incremental crawling is difficult. But :-)

http://issues.apache.org/jira/browse/NUTCH-61

Like you many others want this to happen. And to the best of my knowledge
Andrzej Bialecki will be addressing the issue after 0.9 release .. which
is
anytime now :-)

So you might give it a go with Nutch-61 but NOTE it doesn't work with
current trunk.

Regards
raj

Re: incremental crawling

Reply via email to