So I’m having issues with getting item pipeline to drop correctly items that contain these words in them: jpg, png.
The spider itself its really straightforward. I have long list of start urls, and then just 2 items: url(which is response.url), and email. For some reason I’m getting images that contain ‘@’ sign in them, for example: im...@24x24.png or something like that… Here’s the line where I scrape only emails: emails = response.xpath('//text()').re(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}") Is there any better way of grabbing only emails? I’m not good with regex at all. Here’s my settings: ITEM_PIPELINES = { 'get_emails.pipelines.GetEmailsPipeline': 300, } And here are pipelines: from scrapy.exceptions import DropItem class GetEmailsPipeline(object): def process_item(self, item, spider): if item: if '.jpg' or '.png' in item['email']: raise DropItem("Found image in %s" % item) else: return item And here’s parse function which is basically the whole spider: def parse(self, response): l = ItemLoader(item=GetEmailsItem(), response=response) l.default_output_processor = MapCompose(lambda v: v.strip(), replace_escape_chars) emails = response.xpath('//text()').re(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}") l.add_value('email', emails) l.add_value('url', response.url) return l.load_item() The output that I’m getting is that in every single url that scrapy visits it founds image and Drops item, which is not case when I run it without pipeline. Any help is appreciated. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.