Pushpesh, We extended nutch with a whitelist filter and you might find it useful. Check the comments from Matt Kangas here: http://issues.apache.org/jira/browse/NUTCH-87;jsessionid=6F6AD5423357184CF57B51B003201C49?page=all
--Flo Pushpesh Kr. Rajwanshi wrote: >hmmm... actually my requirement is a bit more complex than it seems so url >filters alone probably would do. Because i am not filtering urls based only >on some domain name but within domain i want to discard some urls, and since >they actually dont follow a pattern hence i cant use url filters otherwise >url filters would have done great job. > >Thanks anyway >Pushpesh > > >On 12/19/05, "Håvard W. Kongsgård" <[EMAIL PROTECTED]> wrote: > > >>About this "blocking" you can try to use the urlfilters, change the >>filter between each fetch/generate >> >>+^http://www.abc.com >> >>-^http://www.bbc.co.uk >> >> >>Pushpesh Kr. Rajwanshi wrote: >> >> >> >>>Oh this is pretty good and quite helpful material i wanted. Thanks Havard >>>for this. Seems like this will help me writing code for stuff i need :-) >>> >>>Thanks and Regards, >>>Pushpesh >>> >>> >>> >>>On 12/19/05, "Håvard W. Kongsgård" <[EMAIL PROTECTED]> wrote: >>> >>> >>> >>> >>>>Try using the whole-web fetching method instead of the crawl method. >>>> >>>>http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling >>>> >>>>http://wiki.media-style.com/display/nutchDocu/quick+tutorial >>>> >>>> >>>>Pushpesh Kr. Rajwanshi wrote: >>>> >>>> >>>> >>>> >>>> >>>>>Hi Stefan, >>>>> >>>>>Thanks for lightening fast reply. I was amazed to see such quick >>>>> >>>>> >>response >> >> >>>>>really appreciate it. >>>>> >>>>>Actually what i am really looking is, suppose i run a crawl for >>>>> >>>>> >>sometime >> >> >>>>>sites say 5 and for some depth say 2. Then what i want is next time i >>>>> >>>>> >>run >> >> >>>>> >>>>> >>>>a >>>> >>>> >>>> >>>> >>>>>crawl it should re use the webdb contents which it populated first >>>>> >>>>> >>time. >> >> >>>>>(Assuming a successful crawl. Yea you are right a suddenly broken down >>>>> >>>>> >>>>> >>>>> >>>>crawl >>>> >>>> >>>> >>>> >>>>>wont work as it has lost its integrity of data) >>>>> >>>>>As you said we can run tools provided by nutch to do step by step >>>>> >>>>> >>>>> >>>>> >>>>commands >>>> >>>> >>>> >>>> >>>>>needed to crawl, but isnt there some way i can reuse the existing crawl >>>>>data? May be it involves changing code but thats ok. Just one more >>>>> >>>>> >>quick >> >> >>>>>question, why every crawl needs a new directory and there isnt an >>>>> >>>>> >>option >> >> >>>>> >>>>> >>>>to >>>> >>>> >>>> >>>> >>>>>alteast reuse the webdb? May be i am asking something silly but i am >>>>>clueless :-( >>>>> >>>>>Or as you said may be what i can do is to explore the steps u mentioned >>>>> >>>>> >>>>> >>>>> >>>>and >>>> >>>> >>>> >>>> >>>>>get what i need. >>>>> >>>>>Thanks again, >>>>>Pushpesh >>>>> >>>>> >>>>>On 12/19/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>>It is difficult to answer your question since the used vocabulary is >>>>>>may wrong. >>>>>>You can refetch pages, no problem. But you can not continue a crashed >>>>>>fetch process. >>>>>>Nutch provides a tool that runs a set of steps like, segment >>>>>>generation, fetching, db updateting etc. >>>>>>So may first try to run these steps manually instead of using the >>>>>>crawl command. >>>>>>Than you may will already get an idea where you can jump in to grep >>>>>>your needed data. >>>>>> >>>>>>Stefan >>>>>> >>>>>>Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>Hi, >>>>>>> >>>>>>>I am crawling some sites using nutch. My requirement is, when i run >>>>>>>a nutch >>>>>>>crawl, then somehow it should be able to reuse the data in webdb >>>>>>>populated >>>>>>>in previous crawl. >>>>>>> >>>>>>>In other words my question is suppose my crawl is running and i >>>>>>>cancel it >>>>>>>somewhere in middle, then is there someway i can resume the crawl ? >>>>>>> >>>>>>> >>>>>>>I dont know even if i can do this at all or if there is some way >>>>>>>then please >>>>>>>throw some light on this. >>>>>>> >>>>>>>TIA >>>>>>> >>>>>>>Regards, >>>>>>>Pushpesh >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>------------------------------------------------------------------------ >>>>> >>>>>No virus found in this incoming message. >>>>>Checked by AVG Free Edition. >>>>>Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: >>>>> >>>>> >>>>> >>>>> >>>>16.12.2005 >>>> >>>> >>>> >>>> >>>>> >>>>> >>>> >>>> >>> >>>------------------------------------------------------------------------ >>> >>>No virus found in this incoming message. >>>Checked by AVG Free Edition. >>>Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: >>> >>> >>16.12.2005 >> >> >>> >>> >> >> > > >