Re: is nutch recrawl possible?

Håvard W. Kongsgård Mon, 19 Dec 2005 09:08:39 -0800

About this "blocking" you can try to use the urlfilters, change thefilter between each fetch/generate


+^http://www.abc.com


-^http://www.bbc.co.uk


Pushpesh Kr. Rajwanshi wrote:

Oh this is pretty good and quite helpful material i wanted. Thanks Havard
for this. Seems like this will help me writing code for stuff i need :-)

Thanks and Regards,
Pushpesh



On 12/19/05, "Håvard W. Kongsgård" <[EMAIL PROTECTED]> wrote:

Try using the whole-web fetching method instead of the crawl method.

http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling

http://wiki.media-style.com/display/nutchDocu/quick+tutorial


Pushpesh Kr. Rajwanshi wrote:

Hi Stefan,

Thanks for lightening fast reply. I was amazed to see such quick response
really appreciate it.

Actually what i am really looking is, suppose i run a crawl for sometime
sites say 5 and for some depth say 2. Then what i want is next time i run

crawl it should re use the webdb contents which it populated first time.
(Assuming a successful crawl. Yea you are right a suddenly broken down

crawl

wont work as it has lost its integrity of data)

As you said we can run tools provided by nutch to do step by step

commands

needed to crawl, but isnt there some way i can reuse the existing crawl
data? May be it involves changing code but thats ok. Just one more quick
question, why every crawl needs a new directory and there isnt an option

to

alteast reuse the webdb? May be i am asking something silly but i am
clueless :-(

Or as you said may be what i can do is to explore the steps u mentioned

and

get what i need.

Thanks again,
Pushpesh


On 12/19/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:

It is difficult to answer your question since the used vocabulary is
may wrong.
You can refetch pages, no problem. But you can not continue a crashed
fetch process.
Nutch provides a tool that runs a set of steps like, segment
generation, fetching, db updateting etc.
So may first try to run these steps manually instead of using the
crawl command.
Than you may will already get an idea where you can jump in to grep
your needed data.

Stefan

Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:

Hi,

I am crawling some sites using nutch. My requirement is, when i run
a nutch
crawl, then somehow it should be able to reuse the data in webdb
populated
in previous crawl.

In other words my question is suppose my crawl is running and i
cancel it
somewhere in middle, then is there someway i can resume the crawl ?


I dont know even if i can do this at all or if there is some way
then please
throw some light on this.

TIA

Regards,
Pushpesh


------------------------------------------------------------------------

No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:

16.12.2005

------------------------------------------------------------------------

No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005

Re: is nutch recrawl possible?

Reply via email to