Maybe I'll play around with it this weekend. Howie
>How about making this a commandline-option to inject? Could you create an >improvement-patch? > > >Regards, > Stefan > >Howie Wang wrote: >>If you don't mind changing the source a little, I would change >>the org.apache.nutch.db.WebDBInjector.java file so that >>when you try to inject a url that is already there, it will update >>it's next fetch date so that it will get fetched during the next >>crawl. >> >>In WebDBInjector.java in the addPage method, change: >> >> dbWriter.addPageIfNotPresent(page); >> >>to: >> >> dbWriter.addPageWithScore(page); >> >>Every day you can take your list of changed/deleted urls and do: >> >> bin/nutch inject mynutchdb/db -urlfile my_changed_urls.txt >> >>Then do your crawl as usual. The updated pages will be refetched. >>The deleted pages will attempt to be refetched, but will error out, >>and be removed from the index. >> >>You could also set your db.default.fetch.interval parameter to >>longer than 30 days if you are sure you know what pages are changing. >> >>Howie >> >>>With my tests, I index ~60k documents. This process takes several hours. >>> I >>>plan on having about a half million documents index eventually, and I >>>suspect it'll take more than 24 hours to recrawl and reindex with my >>>hardware, so I'm concerned. >>> >>>I *know* which documents I want to reindex or remove. It's going to be a >>>very small subset compared to the whole group (I imagine around 1000 >>>pages). That's why I desperately want to be able to give Nutch a list of >>>documents. >>> >>>Ben >>> >>>On 6/8/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote: >>>> >>>>Just recrawl and reindex every day. That was the simple answer. >>>>The more complex answer is you need to do write custom code that >>>>deletes documents from your index and crawld. >>>>If you not want to complete learn the internals of nutch, just >>>>recrawl and reindex. :) >>>> >>>>Stefan >>>>Am 06.06.2006 um 19:42 schrieb Benjamin Higgins: >>>> >>>> > Hello, >>>> > >>>> > I'm trying to get Nutch suitable to use for our (extensive) >>>> > intranet. One >>>> > problem I'm trying to solve is how best to tell Nutch to either >>>> > reindex or >>>> > remove a URL from the index. I have a lot of pages that get >>>> > changed, added >>>> > and removed daily, and I'd prefer to have the changes reflected in >>>> > Nutch's >>>> > index immediately. >>>> > >>>> > I am able to generate a list of URLs that have changed or have been >>>> > removed, >>>> > so I definately do not need to reindex everything, I just need a >>>> > way to pass >>>> > this list on to Nutch. >>>> > >>>> > How can I do this? >>>> > >>>> > Ben _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
