They're separate pieces of code so I think it should
be OK. WebDBInjector.java and UpdateDatabaseTool.java
do their own separate webdb manipulations.

Howie

>it just came to my mind, just to make sure (don't have the code at
>hand): updatedb uses a different portion of code, right? Otherwise we
>might re-crawl URLs we just fetched because links are found to URLs we
>just fetched :-)
>
>
>Regards,
>  Stefan
>
>Howie Wang wrote:
> > If you don't mind changing the source a little, I would change
> > the org.apache.nutch.db.WebDBInjector.java file so that
> > when you try to inject a url that is already there, it will update
> > it's next fetch date so that it will get fetched during the next
> > crawl.
> >
> > In WebDBInjector.java in the addPage method, change:
> >
> >  dbWriter.addPageIfNotPresent(page);
> >
> > to:
> >
> >  dbWriter.addPageWithScore(page);
> >
> > Every day you can take your list of changed/deleted urls and do:
> >
> >    bin/nutch inject mynutchdb/db -urlfile my_changed_urls.txt
> >
> > Then do your crawl as usual. The updated pages will be refetched.
> > The deleted pages will attempt to be refetched, but will error out,
> > and be removed from the index.
> >
> > You could also set your db.default.fetch.interval parameter to
> > longer than 30 days if you are sure you know what pages are changing.
> >
> > Howie
> >
> >> With my tests, I index ~60k documents.  This process takes several
> >> hours.  I
> >> plan on having about a half million documents index eventually, and I
> >> suspect it'll take more than 24 hours to recrawl and reindex with my
> >> hardware, so I'm concerned.
> >>
> >> I *know* which documents I want to reindex or remove.  It's going to be 
>a
> >> very small subset compared to the whole group (I imagine around 1000
> >> pages).  That's why I desperately want to be able to give Nutch a list 
>of
> >> documents.
> >>
> >> Ben
> >>
> >> On 6/8/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> >>>
> >>> Just recrawl and reindex every day. That was the simple answer.
> >>> The more complex answer is you need to do write custom code that
> >>> deletes documents from your index and crawld.
> >>> If you not want to complete learn the internals of nutch, just
> >>> recrawl and reindex. :)
> >>>
> >>> Stefan
> >>> Am 06.06.2006 um 19:42 schrieb Benjamin Higgins:
> >>>
> >>> > Hello,
> >>> >
> >>> > I'm trying to get Nutch suitable to use for our (extensive)
> >>> > intranet.  One
> >>> > problem I'm trying to solve is how best to tell Nutch to either
> >>> > reindex or
> >>> > remove a URL from the index.  I have a lot of pages that get
> >>> > changed, added
> >>> > and removed daily, and I'd prefer to have the changes reflected in
> >>> > Nutch's
> >>> > index immediately.
> >>> >
> >>> > I am able to generate a list of URLs that have changed or have been
> >>> > removed,
> >>> > so I definately do not need to reindex everything, I just need a
> >>> > way to pass
> >>> > this list on to Nutch.
> >>> >
> >>> > How can I do this?
> >>> >
> >>> > Ben




_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to