Re: is nutch recrawl possible?

Stefan Groschupf Mon, 19 Dec 2005 06:29:39 -0800

Still do not clearly understand you plans, sorry. However pages fromthe webdb are recrawled every 30 days (but configurable in the nutch-default.xml).The new folder are so called segments and you can put it to the trashafter 30 days.So what you can do is first never updated your webdb with the fetchedsegment, that will not add new urls, or alternative use a url filter.

You will find a lot of posts in the mail archive regarding this issues.
Stefan
Am 19.12.2005 um 15:18 schrieb Pushpesh Kr. Rajwanshi:

Hi Stefan,
Thanks for lightening fast reply. I was amazed to see such quickresponse
really appreciate it.
Actually what i am really looking is, suppose i run a crawl forsometimesites say 5 and for some depth say 2. Then what i want is next timei run acrawl it should re use the webdb contents which it populated firsttime.(Assuming a successful crawl. Yea you are right a suddenly brokendown crawl
wont work as it has lost its integrity of data)
As you said we can run tools provided by nutch to do step by stepcommandsneeded to crawl, but isnt there some way i can reuse the existingcrawldata? May be it involves changing code but thats ok. Just one morequickquestion, why every crawl needs a new directory and there isnt anoption to
alteast reuse the webdb? May be i am asking something silly but i am
clueless :-(
Or as you said may be what i can do is to explore the steps umentioned and
get what i need.

Thanks again,
Pushpesh


On 12/19/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
It is difficult to answer your question since the used vocabulary is
may wrong.
You can refetch pages, no problem. But you can not continue a crashed
fetch process.
Nutch provides a tool that runs a set of steps like, segment
generation, fetching, db updateting etc.
So may first try to run these steps manually instead of using the
crawl command.
Than you may will already get an idea where you can jump in to grep
your needed data.

Stefan

Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:
Hi,

I am crawling some sites using nutch. My requirement is, when i run
a nutch
crawl, then somehow it should be able to reuse the data in webdb
populated
in previous crawl.

In other words my question is suppose my crawl is running and i
cancel it
somewhere in middle, then is there someway i can resume the crawl ?


I dont know even if i can do this at all or if there is some way
then please
throw some light on this.

TIA

Regards,
Pushpesh

Re: is nutch recrawl possible?

Reply via email to