Re: crawlspider doesn't listen deny rule

Paul Tremberth Tue, 22 Jul 2014 13:25:26 -0700

Hi Mustafa,

see my answer on StackOverflow: http://stackoverflow.com/a/24897046/2572383
The issue is with 'catinfo\.asp\?brw'
that is checked before canonicalization.


If you use 'catinfo\.asp\?.*brw' you should get what you want.

Hope this helps.

Paul.

On Tuesday, July 22, 2014 9:48:21 PM UTC+2, Mustafa Hastürk wrote:
>
> I have written the following spider to crawl nautilusconcept.com 
> <http://www.nautilusconcept.com/> . The category structure of site is so 
> bad. Because of it I had to apply rules as it parse all link with callback. 
> I determine which url should be parse with if statement inside parse_item 
> method. Anyway spider doesn't listen my deny rules and still trying to 
> crawl contains (?brw....) links.
>
> Here is my spider;
>
> from scrapy.contrib.linkextractors import LinkExtractorfrom 
> scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.selector import 
> Selectorfrom vitrinbot.items import ProductItemfrom vitrinbot.base import 
> utilsimport hashlib
>
> removeCurrency = utils.removeCurrency
> getCurrency = utils.getCurrency
> class NautilusSpider(CrawlSpider):
>     name = 'nautilus'
>     allowed_domains = ['nautilusconcept.com']
>     start_urls = ['http://www.nautilusconcept.com/']
>     xml_filename = 'nautilus-%d.xml'
>     xpaths = {
>         'category' :'//tr[@class="KategoriYazdirTabloTr"]//a/text()',
>         'title':'//h1[@class="UrunBilgisiUrunAdi"]/text()',
>         'price':'//hemenalfiyat/text()',
>         'images':'//td[@class="UrunBilgisiUrunResimSlaytTd"]//div/a/@href',
>         'description':'//td[@class="UrunBilgisiUrunBilgiIcerikTd"]//*/text()',
>         'currency':'//*[@id="UrunBilgisiUrunFiyatiDiv"]/text()',
>         'check_page':'//div[@class="ayrinti"]'
>     }
>
>     rules = (
>
>         Rule(
>             LinkExtractor(allow=('com/[\w_]+',),
>
>                           deny=('asp$',
>                                 'login\.asp'
>                                 'hakkimizda\.asp',
>                                 'musteri_hizmetleri\.asp',
>                                 'iletisim_formu\.asp',
>                                 'yardim\.asp',
>                                 'sepet\.asp',
>                                 'catinfo\.asp\?brw',
>                           ),
>             ),
>             callback='parse_item',
>             follow=True
>         ),
>
>     )
>
>
>     def parse_item(self, response):
>         i = ProductItem()
>         sl = Selector(response=response)
>
>         if not sl.xpath(self.xpaths['check_page']):
>             return i
>
>         i['id'] = hashlib.md5(response.url.encode('utf-8')).hexdigest()
>         i['url'] = response.url
>         i['category'] = " > 
> ".join(sl.xpath(self.xpaths['category']).extract()[1:-1])
>         i['title'] = sl.xpath(self.xpaths['title']).extract()[0].strip()
>         i['special_price'] = i['price'] = 
> sl.xpath(self.xpaths['price']).extract()[0].strip().replace(',','.')
>
>         images = []
>         for img in sl.xpath(self.xpaths['images']).extract():
>             images.append("http://www.nautilusconcept.com/"+img)
>         i['images'] = images
>
>         i['description'] = (" ".<span class="pln" style="font-size: 
> 13.600000381469727px; 
>
> ...

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: crawlspider doesn't listen deny rule

Reply via email to