Thanks for that link (and a note to self; don't ask a question on the list
just before going on vacation...)
Perhaps I don't understand the patch, but It seems, that the it is only
meant to avoid recrawling content that hasn't changed. It doesn't really
have to do with avoiding a rebuild of the entire index if I add a document;
or does it?
Does Nutch have the ability to add to an index without a complete rebuild,
or is a complete rebuild required if I add even a single document?
Furthermore, even if I were to decide that the complete rebuild is
acceptable, Nutch is still discarding my custom fields from all documents
that are not being updated. Why is this happening?
I appreciate the help; thanks.
-Charlie
On 4/14/07, rubdabadub <[EMAIL PROTECTED]> wrote:
Hi Cahrlie:
On 4/14/07, c wanek <[EMAIL PROTECTED]> wrote:
> Greetings,
>
> Now I'm at the point where I would like to add to my crawl, with a new
set
> of seed urls. Using a variation on the recrawl script on the wiki, I
can
> make this happen, but I am running into a what is, for me, a showstopper
> issue. The custom fields I added to the documents of the first crawl
are
> lost when the documents from the second crawl are added to the index.
Nutch is all about writing once. All operation write once this is how
map-reduce
works.. This is why incremental crawling is difficult. But :-)
http://issues.apache.org/jira/browse/NUTCH-61
Like you many others want this to happen. And to the best of my knowledge
Andrzej Bialecki will be addressing the issue after 0.9 release .. which
is
anytime now :-)
So you might give it a go with Nutch-61 but NOTE it doesn't work with
current trunk.
Regards
raj
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general