hmmm... actually my requirement is a bit more complex than it seems so url filters alone probably would do. Because i am not filtering urls based only on some domain name but within domain i want to discard some urls, and since they actually dont follow a pattern hence i cant use url filters otherwise url filters would have done great job.
Thanks anyway Pushpesh On 12/19/05, "Håvard W. Kongsgård" <[EMAIL PROTECTED]> wrote: > > About this "blocking" you can try to use the urlfilters, change the > filter between each fetch/generate > > +^http://www.abc.com > > -^http://www.bbc.co.uk > > > Pushpesh Kr. Rajwanshi wrote: > > >Oh this is pretty good and quite helpful material i wanted. Thanks Havard > >for this. Seems like this will help me writing code for stuff i need :-) > > > >Thanks and Regards, > >Pushpesh > > > > > > > >On 12/19/05, "Håvard W. Kongsgård" <[EMAIL PROTECTED]> wrote: > > > > > >>Try using the whole-web fetching method instead of the crawl method. > >> > >>http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling > >> > >>http://wiki.media-style.com/display/nutchDocu/quick+tutorial > >> > >> > >>Pushpesh Kr. Rajwanshi wrote: > >> > >> > >> > >>>Hi Stefan, > >>> > >>>Thanks for lightening fast reply. I was amazed to see such quick > response > >>>really appreciate it. > >>> > >>>Actually what i am really looking is, suppose i run a crawl for > sometime > >>>sites say 5 and for some depth say 2. Then what i want is next time i > run > >>> > >>> > >>a > >> > >> > >>>crawl it should re use the webdb contents which it populated first > time. > >>>(Assuming a successful crawl. Yea you are right a suddenly broken down > >>> > >>> > >>crawl > >> > >> > >>>wont work as it has lost its integrity of data) > >>> > >>>As you said we can run tools provided by nutch to do step by step > >>> > >>> > >>commands > >> > >> > >>>needed to crawl, but isnt there some way i can reuse the existing crawl > >>>data? May be it involves changing code but thats ok. Just one more > quick > >>>question, why every crawl needs a new directory and there isnt an > option > >>> > >>> > >>to > >> > >> > >>>alteast reuse the webdb? May be i am asking something silly but i am > >>>clueless :-( > >>> > >>>Or as you said may be what i can do is to explore the steps u mentioned > >>> > >>> > >>and > >> > >> > >>>get what i need. > >>> > >>>Thanks again, > >>>Pushpesh > >>> > >>> > >>>On 12/19/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote: > >>> > >>> > >>> > >>> > >>>>It is difficult to answer your question since the used vocabulary is > >>>>may wrong. > >>>>You can refetch pages, no problem. But you can not continue a crashed > >>>>fetch process. > >>>>Nutch provides a tool that runs a set of steps like, segment > >>>>generation, fetching, db updateting etc. > >>>>So may first try to run these steps manually instead of using the > >>>>crawl command. > >>>>Than you may will already get an idea where you can jump in to grep > >>>>your needed data. > >>>> > >>>>Stefan > >>>> > >>>>Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: > >>>> > >>>> > >>>> > >>>> > >>>> > >>>>>Hi, > >>>>> > >>>>>I am crawling some sites using nutch. My requirement is, when i run > >>>>>a nutch > >>>>>crawl, then somehow it should be able to reuse the data in webdb > >>>>>populated > >>>>>in previous crawl. > >>>>> > >>>>>In other words my question is suppose my crawl is running and i > >>>>>cancel it > >>>>>somewhere in middle, then is there someway i can resume the crawl ? > >>>>> > >>>>> > >>>>>I dont know even if i can do this at all or if there is some way > >>>>>then please > >>>>>throw some light on this. > >>>>> > >>>>>TIA > >>>>> > >>>>>Regards, > >>>>>Pushpesh > >>>>> > >>>>> > >>>>> > >>>>> > >>>> > >>>> > >>> > > >>>------------------------------------------------------------------------ > >>> > >>>No virus found in this incoming message. > >>>Checked by AVG Free Edition. > >>>Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: > >>> > >>> > >>16.12.2005 > >> > >> > >>> > >>> > >> > >> > > > > > > > >------------------------------------------------------------------------ > > > >No virus found in this incoming message. > >Checked by AVG Free Edition. > >Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: > 16.12.2005 > > > > > >