Re: [Nutch-general] Removing or reindexing a URL?

Howie Wang Thu, 08 Jun 2006 21:29:23 -0700

Maybe I'll play around with it this weekend.

Howie



>How about making this a commandline-option to inject? Could you create an 
>improvement-patch?
>
>
>Regards,
>  Stefan
>
>Howie Wang wrote:
>>If you don't mind changing the source a little, I would change
>>the org.apache.nutch.db.WebDBInjector.java file so that
>>when you try to inject a url that is already there, it will update
>>it's next fetch date so that it will get fetched during the next
>>crawl.
>>
>>In WebDBInjector.java in the addPage method, change:
>>
>>  dbWriter.addPageIfNotPresent(page);
>>
>>to:
>>
>>  dbWriter.addPageWithScore(page);
>>
>>Every day you can take your list of changed/deleted urls and do:
>>
>>    bin/nutch inject mynutchdb/db -urlfile my_changed_urls.txt
>>
>>Then do your crawl as usual. The updated pages will be refetched.
>>The deleted pages will attempt to be refetched, but will error out,
>>and be removed from the index.
>>
>>You could also set your db.default.fetch.interval parameter to
>>longer than 30 days if you are sure you know what pages are changing.
>>
>>Howie
>>
>>>With my tests, I index ~60k documents.  This process takes several hours. 
>>>  I
>>>plan on having about a half million documents index eventually, and I
>>>suspect it'll take more than 24 hours to recrawl and reindex with my
>>>hardware, so I'm concerned.
>>>
>>>I *know* which documents I want to reindex or remove.  It's going to be a
>>>very small subset compared to the whole group (I imagine around 1000
>>>pages).  That's why I desperately want to be able to give Nutch a list of
>>>documents.
>>>
>>>Ben
>>>
>>>On 6/8/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>>>>
>>>>Just recrawl and reindex every day. That was the simple answer.
>>>>The more complex answer is you need to do write custom code that
>>>>deletes documents from your index and crawld.
>>>>If you not want to complete learn the internals of nutch, just
>>>>recrawl and reindex. :)
>>>>
>>>>Stefan
>>>>Am 06.06.2006 um 19:42 schrieb Benjamin Higgins:
>>>>
>>>> > Hello,
>>>> >
>>>> > I'm trying to get Nutch suitable to use for our (extensive)
>>>> > intranet.  One
>>>> > problem I'm trying to solve is how best to tell Nutch to either
>>>> > reindex or
>>>> > remove a URL from the index.  I have a lot of pages that get
>>>> > changed, added
>>>> > and removed daily, and I'd prefer to have the changes reflected in
>>>> > Nutch's
>>>> > index immediately.
>>>> >
>>>> > I am able to generate a list of URLs that have changed or have been
>>>> > removed,
>>>> > so I definately do not need to reindex everything, I just need a
>>>> > way to pass
>>>> > this list on to Nutch.
>>>> >
>>>> > How can I do this?
>>>> >
>>>> > Ben




_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Removing or reindexing a URL?

Reply via email to