Hi Peter, I've answered your question on SO, hope it helps :) On Thu, Jun 4, 2015 at 10:41 AM, Peter Benson <[email protected]> wrote:
> Hi all, > > My original Stackoverflow question can be found here: > http://stackoverflow.com/questions/29883132/stop-scrapy-crawling-the-same-urls > > > I've written a very basic CrawlSpider but it keeps crawling the same URLs > - I was under the impression that Scrapy had a built in dupe-filter that > stopped it doing this? Have I accidentally written something that is > overriding this? > > So you don't have to do to the link - *my spider*; > > class LsbuSpider(CrawlSpider): > name = "lsbu6" > allowed_domains = ["lsbu.ac.uk"] > start_urls = [ > "http://www.lsbu.ac.uk"] > rules = [ > Rule(SgmlLinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), > callback='parse_item', follow=True),] > def parse_item(self, response): > join = Join() > sel = Selector(response) > bits = sel.xpath('//*') > scraped_bits = [] > for bit in bits: > scraped_bit = LsbuItem() > scraped_bit['title'] = scraped_bit.xpath('//title/text()').extract() > scraped_bit['desc'] = > join(bit.xpath('//*[@id="main_content_main_column"]//text()').extract()).strip() > scraped_bits.append(scraped_bit) > > return scraped_bits > > > My *settings.py* file > > BOT_NAME = 'lsbu6' > DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' > DUPEFILTER_DEBUG = True > SPIDER_MODULES = ['lsbu.spiders'] > NEWSPIDER_MODULE = 'lsbu.spiders' > > > Many thanks in advance > Peter > > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
