Re: [Nutch-general] Removing or reindexing a URL?

Benjamin Higgins Thu, 08 Jun 2006 18:22:19 -0700

Stefan, thank you.  I certainly do not mind writing a shell script or
changing some source.  This is all coming off of one box, so I do worry that
I'd not be able to fit a whole recrawl/reindex in one night once I expand
the crawl to all pages (most are dynamic/drawn from db, and the box is a
little older).


Howie, thanks for this suggestion.  I'm assuming that addPagesIfNotPresent
simply checks first (to make sure the page isn't present), and then calls
addPageWithScore.

I'll try what Howie describes and if that doesn't work out I'll write a
script that prunes then injects.

Thanks, I really do appreciate it!

Ben

On 6/8/06, Howie Wang <[EMAIL PROTECTED]> wrote:


If you don't mind changing the source a little, I would change
the org.apache.nutch.db.WebDBInjector.java file so that
when you try to inject a url that is already there, it will update
it's next fetch date so that it will get fetched during the next
crawl.

In WebDBInjector.java in the addPage method, change:

  dbWriter.addPageIfNotPresent(page);

to:

  dbWriter.addPageWithScore(page);

Every day you can take your list of changed/deleted urls and do:

    bin/nutch inject mynutchdb/db -urlfile my_changed_urls.txt

Then do your crawl as usual. The updated pages will be refetched.
The deleted pages will attempt to be refetched, but will error out,
and be removed from the index.

You could also set your db.default.fetch.interval parameter to
longer than 30 days if you are sure you know what pages are changing.

Howie

>With my tests, I index ~60k documents.  This process takes several hours.
>I
>plan on having about a half million documents index eventually, and I
>suspect it'll take more than 24 hours to recrawl and reindex with my
>hardware, so I'm concerned.
>
>I *know* which documents I want to reindex or remove.  It's going to be a
>very small subset compared to the whole group (I imagine around 1000
>pages).  That's why I desperately want to be able to give Nutch a list of
>documents.
>
>Ben
>
>On 6/8/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>>
>>Just recrawl and reindex every day. That was the simple answer.
>>The more complex answer is you need to do write custom code that
>>deletes documents from your index and crawld.
>>If you not want to complete learn the internals of nutch, just
>>recrawl and reindex. :)
>>
>>Stefan
>>Am 06.06.2006 um 19:42 schrieb Benjamin Higgins:
>>
>> > Hello,
>> >
>> > I'm trying to get Nutch suitable to use for our (extensive)
>> > intranet.  One
>> > problem I'm trying to solve is how best to tell Nutch to either
>> > reindex or
>> > remove a URL from the index.  I have a lot of pages that get
>> > changed, added
>> > and removed daily, and I'd prefer to have the changes reflected in
>> > Nutch's
>> > index immediately.
>> >
>> > I am able to generate a list of URLs that have changed or have been
>> > removed,
>> > so I definately do not need to reindex everything, I just need a
>> > way to pass
>> > this list on to Nutch.
>> >
>> > How can I do this?
>> >
>> > Ben
>>
>>

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Removing or reindexing a URL?

Reply via email to