Hi Mustafa, see my answer on StackOverflow: http://stackoverflow.com/a/24897046/2572383 The issue is with 'catinfo\.asp\?brw' that is checked before canonicalization.
If you use 'catinfo\.asp\?.*brw' you should get what you want. Hope this helps. Paul. On Tuesday, July 22, 2014 9:48:21 PM UTC+2, Mustafa Hastürk wrote: > > I have written the following spider to crawl nautilusconcept.com > <http://www.nautilusconcept.com/> . The category structure of site is so > bad. Because of it I had to apply rules as it parse all link with callback. > I determine which url should be parse with if statement inside parse_item > method. Anyway spider doesn't listen my deny rules and still trying to > crawl contains (?brw....) links. > > Here is my spider; > > from scrapy.contrib.linkextractors import LinkExtractorfrom > scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.selector import > Selectorfrom vitrinbot.items import ProductItemfrom vitrinbot.base import > utilsimport hashlib > > removeCurrency = utils.removeCurrency > getCurrency = utils.getCurrency > class NautilusSpider(CrawlSpider): > name = 'nautilus' > allowed_domains = ['nautilusconcept.com'] > start_urls = ['http://www.nautilusconcept.com/'] > xml_filename = 'nautilus-%d.xml' > xpaths = { > 'category' :'//tr[@class="KategoriYazdirTabloTr"]//a/text()', > 'title':'//h1[@class="UrunBilgisiUrunAdi"]/text()', > 'price':'//hemenalfiyat/text()', > 'images':'//td[@class="UrunBilgisiUrunResimSlaytTd"]//div/a/@href', > 'description':'//td[@class="UrunBilgisiUrunBilgiIcerikTd"]//*/text()', > 'currency':'//*[@id="UrunBilgisiUrunFiyatiDiv"]/text()', > 'check_page':'//div[@class="ayrinti"]' > } > > rules = ( > > Rule( > LinkExtractor(allow=('com/[\w_]+',), > > deny=('asp$', > 'login\.asp' > 'hakkimizda\.asp', > 'musteri_hizmetleri\.asp', > 'iletisim_formu\.asp', > 'yardim\.asp', > 'sepet\.asp', > 'catinfo\.asp\?brw', > ), > ), > callback='parse_item', > follow=True > ), > > ) > > > def parse_item(self, response): > i = ProductItem() > sl = Selector(response=response) > > if not sl.xpath(self.xpaths['check_page']): > return i > > i['id'] = hashlib.md5(response.url.encode('utf-8')).hexdigest() > i['url'] = response.url > i['category'] = " > > ".join(sl.xpath(self.xpaths['category']).extract()[1:-1]) > i['title'] = sl.xpath(self.xpaths['title']).extract()[0].strip() > i['special_price'] = i['price'] = > sl.xpath(self.xpaths['price']).extract()[0].strip().replace(',','.') > > images = [] > for img in sl.xpath(self.xpaths['images']).extract(): > images.append("http://www.nautilusconcept.com/"+img) > i['images'] = images > > i['description'] = (" ".<span class="pln" style="font-size: > 13.600000381469727px; > > ... -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.