Re: is nutch recrawl possible?

Florent Gluck Mon, 19 Dec 2005 11:56:51 -0800

Pushpesh,

We extended nutch with a whitelist filter and you might find it useful. 
Check the comments from Matt Kangas here:
http://issues.apache.org/jira/browse/NUTCH-87;jsessionid=6F6AD5423357184CF57B51B003201C49?page=all


--Flo

Pushpesh Kr. Rajwanshi wrote:

>hmmm... actually my requirement is a bit more complex than it seems so url
>filters alone probably would do. Because i am not filtering urls based only
>on some domain name but within domain i want to discard some urls, and since
>they actually dont follow a pattern hence i cant use url filters otherwise
>url filters would have done great job.
>
>Thanks anyway
>Pushpesh
>
>
>On 12/19/05, "Håvard W. Kongsgård" <[EMAIL PROTECTED]> wrote:
>  
>
>>About this "blocking" you can try to use the urlfilters, change the
>>filter between each  fetch/generate
>>
>>+^http://www.abc.com
>>
>>-^http://www.bbc.co.uk
>>
>>
>>Pushpesh Kr. Rajwanshi wrote:
>>
>>    
>>
>>>Oh this is pretty good and quite helpful material i wanted. Thanks Havard
>>>for this. Seems like this will help me writing code for stuff i need :-)
>>>
>>>Thanks and Regards,
>>>Pushpesh
>>>
>>>
>>>
>>>On 12/19/05, "Håvard W. Kongsgård" <[EMAIL PROTECTED]> wrote:
>>>
>>>
>>>      
>>>
>>>>Try using the whole-web fetching method instead of the crawl method.
>>>>
>>>>http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling
>>>>
>>>>http://wiki.media-style.com/display/nutchDocu/quick+tutorial
>>>>
>>>>
>>>>Pushpesh Kr. Rajwanshi wrote:
>>>>
>>>>
>>>>
>>>>        
>>>>
>>>>>Hi Stefan,
>>>>>
>>>>>Thanks for lightening fast reply. I was amazed to see such quick
>>>>>          
>>>>>
>>response
>>    
>>
>>>>>really appreciate it.
>>>>>
>>>>>Actually what i am really looking is, suppose i run a crawl for
>>>>>          
>>>>>
>>sometime
>>    
>>
>>>>>sites say 5 and for some depth say 2. Then what i want is next time i
>>>>>          
>>>>>
>>run
>>    
>>
>>>>>          
>>>>>
>>>>a
>>>>
>>>>
>>>>        
>>>>
>>>>>crawl it should re use the webdb contents which it populated first
>>>>>          
>>>>>
>>time.
>>    
>>
>>>>>(Assuming a successful crawl. Yea you are right a suddenly broken down
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>crawl
>>>>
>>>>
>>>>        
>>>>
>>>>>wont work as it has lost its integrity of data)
>>>>>
>>>>>As you said we can run tools provided by nutch to do step by step
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>commands
>>>>
>>>>
>>>>        
>>>>
>>>>>needed to crawl, but isnt there some way i can reuse the existing crawl
>>>>>data? May be it involves changing code but thats ok. Just one more
>>>>>          
>>>>>
>>quick
>>    
>>
>>>>>question, why every crawl needs a new directory and there isnt an
>>>>>          
>>>>>
>>option
>>    
>>
>>>>>          
>>>>>
>>>>to
>>>>
>>>>
>>>>        
>>>>
>>>>>alteast reuse the webdb? May be i am asking something silly but i am
>>>>>clueless :-(
>>>>>
>>>>>Or as you said may be what i can do is to explore the steps u mentioned
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>and
>>>>
>>>>
>>>>        
>>>>
>>>>>get what i need.
>>>>>
>>>>>Thanks again,
>>>>>Pushpesh
>>>>>
>>>>>
>>>>>On 12/19/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>>>It is difficult to answer your question since the used vocabulary is
>>>>>>may wrong.
>>>>>>You can refetch pages, no problem. But you can not continue a crashed
>>>>>>fetch process.
>>>>>>Nutch provides a tool that runs a set of steps like, segment
>>>>>>generation, fetching, db updateting etc.
>>>>>>So may first try to run these steps manually instead of using the
>>>>>>crawl command.
>>>>>>Than you may will already get an idea where you can jump in to grep
>>>>>>your needed data.
>>>>>>
>>>>>>Stefan
>>>>>>
>>>>>>Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>Hi,
>>>>>>>
>>>>>>>I am crawling some sites using nutch. My requirement is, when i run
>>>>>>>a nutch
>>>>>>>crawl, then somehow it should be able to reuse the data in webdb
>>>>>>>populated
>>>>>>>in previous crawl.
>>>>>>>
>>>>>>>In other words my question is suppose my crawl is running and i
>>>>>>>cancel it
>>>>>>>somewhere in middle, then is there someway i can resume the crawl ?
>>>>>>>
>>>>>>>
>>>>>>>I dont know even if i can do this at all or if there is some way
>>>>>>>then please
>>>>>>>throw some light on this.
>>>>>>>
>>>>>>>TIA
>>>>>>>
>>>>>>>Regards,
>>>>>>>Pushpesh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>            
>>>>>>
>>>>>------------------------------------------------------------------------
>>>>>
>>>>>No virus found in this incoming message.
>>>>>Checked by AVG Free Edition.
>>>>>Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>16.12.2005
>>>>
>>>>
>>>>        
>>>>
>>>>>          
>>>>>
>>>>        
>>>>
>>>
>>>------------------------------------------------------------------------
>>>
>>>No virus found in this incoming message.
>>>Checked by AVG Free Edition.
>>>Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:
>>>      
>>>
>>16.12.2005
>>    
>>
>>>      
>>>
>>    
>>
>
>  
>

Re: is nutch recrawl possible?

Reply via email to