Hi, it just came to my mind, just to make sure (don't have the code at hand): updatedb uses a different portion of code, right? Otherwise we might re-crawl URLs we just fetched because links are found to URLs we just fetched :-)
Regards, Stefan Howie Wang wrote: > If you don't mind changing the source a little, I would change > the org.apache.nutch.db.WebDBInjector.java file so that > when you try to inject a url that is already there, it will update > it's next fetch date so that it will get fetched during the next > crawl. > > In WebDBInjector.java in the addPage method, change: > > dbWriter.addPageIfNotPresent(page); > > to: > > dbWriter.addPageWithScore(page); > > Every day you can take your list of changed/deleted urls and do: > > bin/nutch inject mynutchdb/db -urlfile my_changed_urls.txt > > Then do your crawl as usual. The updated pages will be refetched. > The deleted pages will attempt to be refetched, but will error out, > and be removed from the index. > > You could also set your db.default.fetch.interval parameter to > longer than 30 days if you are sure you know what pages are changing. > > Howie > >> With my tests, I index ~60k documents. This process takes several >> hours. I >> plan on having about a half million documents index eventually, and I >> suspect it'll take more than 24 hours to recrawl and reindex with my >> hardware, so I'm concerned. >> >> I *know* which documents I want to reindex or remove. It's going to be a >> very small subset compared to the whole group (I imagine around 1000 >> pages). That's why I desperately want to be able to give Nutch a list of >> documents. >> >> Ben >> >> On 6/8/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote: >>> >>> Just recrawl and reindex every day. That was the simple answer. >>> The more complex answer is you need to do write custom code that >>> deletes documents from your index and crawld. >>> If you not want to complete learn the internals of nutch, just >>> recrawl and reindex. :) >>> >>> Stefan >>> Am 06.06.2006 um 19:42 schrieb Benjamin Higgins: >>> >>> > Hello, >>> > >>> > I'm trying to get Nutch suitable to use for our (extensive) >>> > intranet. One >>> > problem I'm trying to solve is how best to tell Nutch to either >>> > reindex or >>> > remove a URL from the index. I have a lot of pages that get >>> > changed, added >>> > and removed daily, and I'd prefer to have the changes reflected in >>> > Nutch's >>> > index immediately. >>> > >>> > I am able to generate a list of URLs that have changed or have been >>> > removed, >>> > so I definately do not need to reindex everything, I just need a >>> > way to pass >>> > this list on to Nutch. >>> > >>> > How can I do this? >>> > >>> > Ben _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
