Re: Scrapy smart-refresh ??

Magikmeuh Thu, 17 Jul 2014 03:30:21 -0700

What I called 'smart refresh algorithm' (but in fact I realise that it's 
maybe not the good term...) is the ability to schedule/change the crawling 
refresh period of pages depending on the changing rate of the content.
You specify 2 range, the min and max crawl refresh. If the content of a 
page never change, it tend to be the max time. 
If, each time you fetch it again the content has changed, it tend to get 
the min. 
If it's partial it evolves between in this range.


Is there something similar ? It would be very strange that it desn't exists 
because I just can't imagine crawling a big site without this 
functionnality... (and having a good refresh rate of pages of course)

In your DeltaFetch if i understand well, it's a way to avoid to recrawl 
pages that has been already fetched.


Le mercredi 16 juillet 2014 11:01:04 UTC+2, Paul Tremberth a écrit :
>
> Hi Frédéric,
>
> what do you mean by "smart refresh crawling"?
> scrapylib has the DeltaFetch spider middleware
>
> https://github.com/scrapinghub/scrapylib/blob/master/scrapylib/deltafetch.py
>
> Paul.
>
> On Wednesday, July 16, 2014 10:15:11 AM UTC+2, Magikmeuh wrote:
>>
>> Hello everyone, 
>>
>> Does scrapy have a smart refresh crawling algorithm ?
>>
>> I don't see any trace of it in the documentation or on this googlegroup;
>>
>> Does someone have already implemented it ?
>>
>> Thanks
>>
>>
>> -- 
>> Frédéric Passaniti
>>  
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Scrapy smart-refresh ??

Reply via email to