Ben,
you can remove pages from the index (prune tool) and may be generate  
a segment that just contains updated or new urls and merge than these  
back into the index.
However this is a manually / shell scrip kind of thing and take some  
time to do so each day.
Try to improve the configuration of your system. 40 pages / sec  
fetching and 1000++ pages / sec indexing should be possible on a  
"normal" box today.
Stefan

Am 09.06.2006 um 01:30 schrieb Benjamin Higgins:

> With my tests, I index ~60k documents.  This process takes several  
> hours.  I
> plan on having about a half million documents index eventually, and I
> suspect it'll take more than 24 hours to recrawl and reindex with my
> hardware, so I'm concerned.
>
> I *know* which documents I want to reindex or remove.  It's going  
> to be a
> very small subset compared to the whole group (I imagine around 1000
> pages).  That's why I desperately want to be able to give Nutch a  
> list of
> documents.
>
> Ben
>
> On 6/8/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>>
>> Just recrawl and reindex every day. That was the simple answer.
>> The more complex answer is you need to do write custom code that
>> deletes documents from your index and crawld.
>> If you not want to complete learn the internals of nutch, just
>> recrawl and reindex. :)
>>
>> Stefan
>> Am 06.06.2006 um 19:42 schrieb Benjamin Higgins:
>>
>> > Hello,
>> >
>> > I'm trying to get Nutch suitable to use for our (extensive)
>> > intranet.  One
>> > problem I'm trying to solve is how best to tell Nutch to either
>> > reindex or
>> > remove a URL from the index.  I have a lot of pages that get
>> > changed, added
>> > and removed daily, and I'd prefer to have the changes reflected in
>> > Nutch's
>> > index immediately.
>> >
>> > I am able to generate a list of URLs that have changed or have been
>> > removed,
>> > so I definately do not need to reindex everything, I just need a
>> > way to pass
>> > this list on to Nutch.
>> >
>> > How can I do this?
>> >
>> > Ben
>>
>>



_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to