Re: is nutch recrawl possible?

Pushpesh Kr. Rajwanshi Mon, 19 Dec 2005 09:39:34 -0800

hmmm... actually my requirement is a bit more complex than it seems so url
filters alone probably would do. Because i am not filtering urls based only
on some domain name but within domain i want to discard some urls, and since
they actually dont follow a pattern hence i cant use url filters otherwise
url filters would have done great job.


Thanks anyway
Pushpesh


On 12/19/05, "Håvard W. Kongsgård" <[EMAIL PROTECTED]> wrote:
>
> About this "blocking" you can try to use the urlfilters, change the
> filter between each  fetch/generate
>
> +^http://www.abc.com
>
> -^http://www.bbc.co.uk
>
>
> Pushpesh Kr. Rajwanshi wrote:
>
> >Oh this is pretty good and quite helpful material i wanted. Thanks Havard
> >for this. Seems like this will help me writing code for stuff i need :-)
> >
> >Thanks and Regards,
> >Pushpesh
> >
> >
> >
> >On 12/19/05, "Håvard W. Kongsgård" <[EMAIL PROTECTED]> wrote:
> >
> >
> >>Try using the whole-web fetching method instead of the crawl method.
> >>
> >>http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling
> >>
> >>http://wiki.media-style.com/display/nutchDocu/quick+tutorial
> >>
> >>
> >>Pushpesh Kr. Rajwanshi wrote:
> >>
> >>
> >>
> >>>Hi Stefan,
> >>>
> >>>Thanks for lightening fast reply. I was amazed to see such quick
> response
> >>>really appreciate it.
> >>>
> >>>Actually what i am really looking is, suppose i run a crawl for
> sometime
> >>>sites say 5 and for some depth say 2. Then what i want is next time i
> run
> >>>
> >>>
> >>a
> >>
> >>
> >>>crawl it should re use the webdb contents which it populated first
> time.
> >>>(Assuming a successful crawl. Yea you are right a suddenly broken down
> >>>
> >>>
> >>crawl
> >>
> >>
> >>>wont work as it has lost its integrity of data)
> >>>
> >>>As you said we can run tools provided by nutch to do step by step
> >>>
> >>>
> >>commands
> >>
> >>
> >>>needed to crawl, but isnt there some way i can reuse the existing crawl
> >>>data? May be it involves changing code but thats ok. Just one more
> quick
> >>>question, why every crawl needs a new directory and there isnt an
> option
> >>>
> >>>
> >>to
> >>
> >>
> >>>alteast reuse the webdb? May be i am asking something silly but i am
> >>>clueless :-(
> >>>
> >>>Or as you said may be what i can do is to explore the steps u mentioned
> >>>
> >>>
> >>and
> >>
> >>
> >>>get what i need.
> >>>
> >>>Thanks again,
> >>>Pushpesh
> >>>
> >>>
> >>>On 12/19/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> >>>
> >>>
> >>>
> >>>
> >>>>It is difficult to answer your question since the used vocabulary is
> >>>>may wrong.
> >>>>You can refetch pages, no problem. But you can not continue a crashed
> >>>>fetch process.
> >>>>Nutch provides a tool that runs a set of steps like, segment
> >>>>generation, fetching, db updateting etc.
> >>>>So may first try to run these steps manually instead of using the
> >>>>crawl command.
> >>>>Than you may will already get an idea where you can jump in to grep
> >>>>your needed data.
> >>>>
> >>>>Stefan
> >>>>
> >>>>Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>Hi,
> >>>>>
> >>>>>I am crawling some sites using nutch. My requirement is, when i run
> >>>>>a nutch
> >>>>>crawl, then somehow it should be able to reuse the data in webdb
> >>>>>populated
> >>>>>in previous crawl.
> >>>>>
> >>>>>In other words my question is suppose my crawl is running and i
> >>>>>cancel it
> >>>>>somewhere in middle, then is there someway i can resume the crawl ?
> >>>>>
> >>>>>
> >>>>>I dont know even if i can do this at all or if there is some way
> >>>>>then please
> >>>>>throw some light on this.
> >>>>>
> >>>>>TIA
> >>>>>
> >>>>>Regards,
> >>>>>Pushpesh
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
>
> >>>------------------------------------------------------------------------
> >>>
> >>>No virus found in this incoming message.
> >>>Checked by AVG Free Edition.
> >>>Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:
> >>>
> >>>
> >>16.12.2005
> >>
> >>
> >>>
> >>>
> >>
> >>
> >
> >
> >
> >------------------------------------------------------------------------
> >
> >No virus found in this incoming message.
> >Checked by AVG Free Edition.
> >Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:
> 16.12.2005
> >
> >
>
>

Re: is nutch recrawl possible?

Reply via email to