Ben, you can remove pages from the index (prune tool) and may be generate a segment that just contains updated or new urls and merge than these back into the index. However this is a manually / shell scrip kind of thing and take some time to do so each day. Try to improve the configuration of your system. 40 pages / sec fetching and 1000++ pages / sec indexing should be possible on a "normal" box today. Stefan
Am 09.06.2006 um 01:30 schrieb Benjamin Higgins: > With my tests, I index ~60k documents. This process takes several > hours. I > plan on having about a half million documents index eventually, and I > suspect it'll take more than 24 hours to recrawl and reindex with my > hardware, so I'm concerned. > > I *know* which documents I want to reindex or remove. It's going > to be a > very small subset compared to the whole group (I imagine around 1000 > pages). That's why I desperately want to be able to give Nutch a > list of > documents. > > Ben > > On 6/8/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote: >> >> Just recrawl and reindex every day. That was the simple answer. >> The more complex answer is you need to do write custom code that >> deletes documents from your index and crawld. >> If you not want to complete learn the internals of nutch, just >> recrawl and reindex. :) >> >> Stefan >> Am 06.06.2006 um 19:42 schrieb Benjamin Higgins: >> >> > Hello, >> > >> > I'm trying to get Nutch suitable to use for our (extensive) >> > intranet. One >> > problem I'm trying to solve is how best to tell Nutch to either >> > reindex or >> > remove a URL from the index. I have a lot of pages that get >> > changed, added >> > and removed daily, and I'd prefer to have the changes reflected in >> > Nutch's >> > index immediately. >> > >> > I am able to generate a list of URLs that have changed or have been >> > removed, >> > so I definately do not need to reindex everything, I just need a >> > way to pass >> > this list on to Nutch. >> > >> > How can I do this? >> > >> > Ben >> >> _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
