Re: [Nutch-general] incremental crawling

c wanek Wed, 18 Apr 2007 11:50:58 -0700

Thanks Raj,

First of all, here's some info I didn't include in the original question;
I'm using Nutch .9, and my attempt to add to my index is basically a
variation of the recrawl script at
http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03
:


inject new urls
fetch loop to "depth":
{
  generate
  fetch
  update db
}
merge segments
invert links
index
dedup
merge indexes

(I wind up with the merged index in 'index/merge-output', instead of
'index', but I thought perhaps I could deal with that weirdness when my
index has the stuff I want...)



Now, perhaps I don't understand the patch you pointed me to, but It
seems that it is only meant to avoid recrawling content that hasn't
changed.  It doesn't really have to do with avoiding a rebuild of the entire
index if I add a document.  Or does it, and I just missed it?

Does Nutch have the ability to add to an index without a complete rebuild,
or is a complete rebuild required if I add even a single document?

Furthermore, even if I were to decide that the complete rebuild is
acceptable, Nutch is still discarding my custom fields from all documents
that are not being updated.  Why is this happening?

I appreciate the help; thanks.
-Charlie


On 4/14/07, rubdabadub <[EMAIL PROTECTED]> wrote:

Hi Cahrlie:

On 4/14/07, c wanek <[EMAIL PROTECTED]> wrote:
> Greetings,
>
> Now I'm at the point where I would like to add to my crawl, with a new
set
> of seed urls.  Using a variation on the recrawl script on the wiki, I
can
> make this happen, but I am running into a what is, for me, a showstopper
> issue.  The custom fields I added to the documents of the first crawl
are
> lost when the documents from the second crawl are added to the index.

Nutch is all about writing once. All operation write once this is how
map-reduce
works.. This is why incremental crawling is difficult. But :-)

http://issues.apache.org/jira/browse/NUTCH-61

Like you many others want this to happen. And to the best of my knowledge
Andrzej Bialecki will be addressing the issue after 0.9 release .. which
is
anytime now :-)

So you might give it a go with Nutch-61 but NOTE it doesn't work with
current trunk.

Regards
raj

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] incremental crawling

Reply via email to