They're separate pieces of code so I think it should be OK. WebDBInjector.java and UpdateDatabaseTool.java do their own separate webdb manipulations.
Howie >it just came to my mind, just to make sure (don't have the code at >hand): updatedb uses a different portion of code, right? Otherwise we >might re-crawl URLs we just fetched because links are found to URLs we >just fetched :-) > > >Regards, > Stefan > >Howie Wang wrote: > > If you don't mind changing the source a little, I would change > > the org.apache.nutch.db.WebDBInjector.java file so that > > when you try to inject a url that is already there, it will update > > it's next fetch date so that it will get fetched during the next > > crawl. > > > > In WebDBInjector.java in the addPage method, change: > > > > dbWriter.addPageIfNotPresent(page); > > > > to: > > > > dbWriter.addPageWithScore(page); > > > > Every day you can take your list of changed/deleted urls and do: > > > > bin/nutch inject mynutchdb/db -urlfile my_changed_urls.txt > > > > Then do your crawl as usual. The updated pages will be refetched. > > The deleted pages will attempt to be refetched, but will error out, > > and be removed from the index. > > > > You could also set your db.default.fetch.interval parameter to > > longer than 30 days if you are sure you know what pages are changing. > > > > Howie > > > >> With my tests, I index ~60k documents. This process takes several > >> hours. I > >> plan on having about a half million documents index eventually, and I > >> suspect it'll take more than 24 hours to recrawl and reindex with my > >> hardware, so I'm concerned. > >> > >> I *know* which documents I want to reindex or remove. It's going to be >a > >> very small subset compared to the whole group (I imagine around 1000 > >> pages). That's why I desperately want to be able to give Nutch a list >of > >> documents. > >> > >> Ben > >> > >> On 6/8/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote: > >>> > >>> Just recrawl and reindex every day. That was the simple answer. > >>> The more complex answer is you need to do write custom code that > >>> deletes documents from your index and crawld. > >>> If you not want to complete learn the internals of nutch, just > >>> recrawl and reindex. :) > >>> > >>> Stefan > >>> Am 06.06.2006 um 19:42 schrieb Benjamin Higgins: > >>> > >>> > Hello, > >>> > > >>> > I'm trying to get Nutch suitable to use for our (extensive) > >>> > intranet. One > >>> > problem I'm trying to solve is how best to tell Nutch to either > >>> > reindex or > >>> > remove a URL from the index. I have a lot of pages that get > >>> > changed, added > >>> > and removed daily, and I'd prefer to have the changes reflected in > >>> > Nutch's > >>> > index immediately. > >>> > > >>> > I am able to generate a list of URLs that have changed or have been > >>> > removed, > >>> > so I definately do not need to reindex everything, I just need a > >>> > way to pass > >>> > this list on to Nutch. > >>> > > >>> > How can I do this? > >>> > > >>> > Ben _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
