Re: [Nutch-general] Removing or reindexing a URL?

Stefan Neufeind Fri, 09 Jun 2006 01:11:13 -0700

Hi,

it just came to my mind, just to make sure (don't have the code at
hand): updatedb uses a different portion of code, right? Otherwise we
might re-crawl URLs we just fetched because links are found to URLs we
just fetched :-)



Regards,
 Stefan

Howie Wang wrote:
> If you don't mind changing the source a little, I would change
> the org.apache.nutch.db.WebDBInjector.java file so that
> when you try to inject a url that is already there, it will update
> it's next fetch date so that it will get fetched during the next
> crawl.
> 
> In WebDBInjector.java in the addPage method, change:
> 
>  dbWriter.addPageIfNotPresent(page);
> 
> to:
> 
>  dbWriter.addPageWithScore(page);
> 
> Every day you can take your list of changed/deleted urls and do:
> 
>    bin/nutch inject mynutchdb/db -urlfile my_changed_urls.txt
> 
> Then do your crawl as usual. The updated pages will be refetched.
> The deleted pages will attempt to be refetched, but will error out,
> and be removed from the index.
> 
> You could also set your db.default.fetch.interval parameter to
> longer than 30 days if you are sure you know what pages are changing.
> 
> Howie
> 
>> With my tests, I index ~60k documents.  This process takes several
>> hours.  I
>> plan on having about a half million documents index eventually, and I
>> suspect it'll take more than 24 hours to recrawl and reindex with my
>> hardware, so I'm concerned.
>>
>> I *know* which documents I want to reindex or remove.  It's going to be a
>> very small subset compared to the whole group (I imagine around 1000
>> pages).  That's why I desperately want to be able to give Nutch a list of
>> documents.
>>
>> Ben
>>
>> On 6/8/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>>>
>>> Just recrawl and reindex every day. That was the simple answer.
>>> The more complex answer is you need to do write custom code that
>>> deletes documents from your index and crawld.
>>> If you not want to complete learn the internals of nutch, just
>>> recrawl and reindex. :)
>>>
>>> Stefan
>>> Am 06.06.2006 um 19:42 schrieb Benjamin Higgins:
>>>
>>> > Hello,
>>> >
>>> > I'm trying to get Nutch suitable to use for our (extensive)
>>> > intranet.  One
>>> > problem I'm trying to solve is how best to tell Nutch to either
>>> > reindex or
>>> > remove a URL from the index.  I have a lot of pages that get
>>> > changed, added
>>> > and removed daily, and I'd prefer to have the changes reflected in
>>> > Nutch's
>>> > index immediately.
>>> >
>>> > I am able to generate a list of URLs that have changed or have been
>>> > removed,
>>> > so I definately do not need to reindex everything, I just need a
>>> > way to pass
>>> > this list on to Nutch.
>>> >
>>> > How can I do this?
>>> >
>>> > Ben


_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Removing or reindexing a URL?

Reply via email to