In their urls they seem to identify articles by a number matching: /(\d+)-[^/]*$|-(\d+)$ Provided you use this identifier when you store articles in a database, you can write a spider middleware that queries the db to determine if you already have the article and decide to allow the request iff you don't. To improve performance for rejection of articles you can cache (on open_spider()) all the article identifiers from the previous day. For the complement, approving articles for scraping, think of a workaround, eg, I would guess their identifier is generated by a sequence, use sorting and don't look further back in the past than a few days before the current session.
Also, look at http://www.bbc.com/news/10628494 , you can parse the date from the feed. Depending on the site, you may miss some articles if erroneous dates are stated. On Saturday, 10 May 2014 19:30:03 UTC+3, jai ven wrote: > > Hello, > > I need guidance of any form, what i'm trying to do is to scrape a news > website such as bbc.co.uk for their daily updates only,is there a way to > do that in scrapy without having to crawl the whole website? > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.