Re: Crawling for daily news articles only

Nikolaos-Digenis Karagiannis Sun, 11 May 2014 02:15:25 -0700

In their urls they seem to identify articles by a number matching: 
/(\d+)-[^/]*$|-(\d+)$
Provided you use this identifier when you store articles in a database, you 
can write a spider middleware that queries the db to determine if you 
already have the article and decide to allow the request iff you don't. To 
improve performance for rejection of articles you can cache (on 
open_spider()) all the article identifiers from the previous day. For the 
complement, approving articles for scraping, think of a workaround, eg, I 
would guess their identifier is generated by a sequence, use sorting and 
don't look further back in the past than a few days before the current 
session.


Also, look at http://www.bbc.com/news/10628494 , you can parse the date 
from the feed. Depending on the site, you may miss some articles if 
erroneous dates are stated.

On Saturday, 10 May 2014 19:30:03 UTC+3, jai ven wrote:
>
> Hello,
>
> I need guidance of any form, what i'm trying to do is to scrape a news 
> website such as bbc.co.uk for their daily updates only,is there a way to 
> do that in scrapy without having to crawl the whole website?
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Crawling for daily news articles only

Reply via email to