Still do not clearly understand you plans, sorry. However pages from the webdb are recrawled every 30 days (but configurable in the nutch- default.xml). The new folder are so called segments and you can put it to the trash after 30 days. So what you can do is first never updated your webdb with the fetched segment, that will not add new urls, or alternative use a url filter.
You will find a lot of posts in the mail archive regarding this issues.
Stefan
Am 19.12.2005 um 15:18 schrieb Pushpesh Kr. Rajwanshi:

Hi Stefan,

Thanks for lightening fast reply. I was amazed to see such quick response
really appreciate it.

Actually what i am really looking is, suppose i run a crawl for sometime sites say 5 and for some depth say 2. Then what i want is next time i run a crawl it should re use the webdb contents which it populated first time. (Assuming a successful crawl. Yea you are right a suddenly broken down crawl
wont work as it has lost its integrity of data)

As you said we can run tools provided by nutch to do step by step commands needed to crawl, but isnt there some way i can reuse the existing crawl data? May be it involves changing code but thats ok. Just one more quick question, why every crawl needs a new directory and there isnt an option to
alteast reuse the webdb? May be i am asking something silly but i am
clueless :-(

Or as you said may be what i can do is to explore the steps u mentioned and
get what i need.

Thanks again,
Pushpesh


On 12/19/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:

It is difficult to answer your question since the used vocabulary is
may wrong.
You can refetch pages, no problem. But you can not continue a crashed
fetch process.
Nutch provides a tool that runs a set of steps like, segment
generation, fetching, db updateting etc.
So may first try to run these steps manually instead of using the
crawl command.
Than you may will already get an idea where you can jump in to grep
your needed data.

Stefan

Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:

Hi,

I am crawling some sites using nutch. My requirement is, when i run
a nutch
crawl, then somehow it should be able to reuse the data in webdb
populated
in previous crawl.

In other words my question is suppose my crawl is running and i
cancel it
somewhere in middle, then is there someway i can resume the crawl ?


I dont know even if i can do this at all or if there is some way
then please
throw some light on this.

TIA

Regards,
Pushpesh



Reply via email to