What I called 'smart refresh algorithm' (but in fact I realise that it's maybe not the good term...) is the ability to schedule/change the crawling refresh period of pages depending on the changing rate of the content. You specify 2 range, the min and max crawl refresh. If the content of a page never change, it tend to be the max time. If, each time you fetch it again the content has changed, it tend to get the min. If it's partial it evolves between in this range.
Is there something similar ? It would be very strange that it desn't exists because I just can't imagine crawling a big site without this functionnality... (and having a good refresh rate of pages of course) In your DeltaFetch if i understand well, it's a way to avoid to recrawl pages that has been already fetched. Le mercredi 16 juillet 2014 11:01:04 UTC+2, Paul Tremberth a écrit : > > Hi Frédéric, > > what do you mean by "smart refresh crawling"? > scrapylib has the DeltaFetch spider middleware > > https://github.com/scrapinghub/scrapylib/blob/master/scrapylib/deltafetch.py > > Paul. > > On Wednesday, July 16, 2014 10:15:11 AM UTC+2, Magikmeuh wrote: >> >> Hello everyone, >> >> Does scrapy have a smart refresh crawling algorithm ? >> >> I don't see any trace of it in the documentation or on this googlegroup; >> >> Does someone have already implemented it ? >> >> Thanks >> >> >> -- >> Frédéric Passaniti >> > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.