Re: Removing or reindexing a URL?

Stefan Groschupf Thu, 08 Jun 2006 16:45:40 -0700

Ben,

you can remove pages from the index (prune tool) and may be generatea segment that just contains updated or new urls and merge than theseback into the index.However this is a manually / shell scrip kind of thing and take sometime to do so each day.Try to improve the configuration of your system. 40 pages / secfetching and 1000++ pages / sec indexing should be possible on a"normal" box today.

Stefan


Am 09.06.2006 um 01:30 schrieb Benjamin Higgins:

With my tests, I index ~60k documents. This process takes severalhours. I

plan on having about a half million documents index eventually, and I
suspect it'll take more than 24 hours to recrawl and reindex with my
hardware, so I'm concerned.

I *know* which documents I want to reindex or remove. It's goingto be a

very small subset compared to the whole group (I imagine around 1000

pages). That's why I desperately want to be able to give Nutch alist of

documents.

Ben

On 6/8/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:


Just recrawl and reindex every day. That was the simple answer.
The more complex answer is you need to do write custom code that
deletes documents from your index and crawld.
If you not want to complete learn the internals of nutch, just
recrawl and reindex. :)

Stefan
Am 06.06.2006 um 19:42 schrieb Benjamin Higgins:

> Hello,
>
> I'm trying to get Nutch suitable to use for our (extensive)
> intranet.  One
> problem I'm trying to solve is how best to tell Nutch to either
> reindex or
> remove a URL from the index.  I have a lot of pages that get
> changed, added
> and removed daily, and I'd prefer to have the changes reflected in
> Nutch's
> index immediately.
>
> I am able to generate a list of URLs that have changed or have been
> removed,
> so I definately do not need to reindex everything, I just need a
> way to pass
> this list on to Nutch.
>
> How can I do this?
>
> Ben

Re: Removing or reindexing a URL?

Reply via email to