So I’m having issues with getting item pipeline to drop correctly items 
that contain these words in them: jpg, png.

The spider itself its really straightforward. I have long list of start 
urls, and then just 2 items: url(which is response.url), and email.

For some reason I’m getting images that contain ‘@’ sign in them, for 
example: im...@24x24.png or something like that…

Here’s the line where I scrape only emails:

emails = 
response.xpath('//text()').re(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}")

Is there any better way of grabbing only emails? I’m not good with regex at 
all.

Here’s my settings:

ITEM_PIPELINES = {
    'get_emails.pipelines.GetEmailsPipeline': 300,
}

And here are pipelines:

from scrapy.exceptions import DropItem
class GetEmailsPipeline(object):
    def process_item(self, item, spider):
        if item:
            if '.jpg' or '.png' in item['email']:
                raise DropItem("Found image in %s" % item)
        else:
            return item

And here’s parse function which is basically the whole spider:

def parse(self, response):
        l = ItemLoader(item=GetEmailsItem(), response=response)
        l.default_output_processor = MapCompose(lambda v: v.strip(), 
replace_escape_chars)

        emails = 
response.xpath('//text()').re(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}")

        l.add_value('email', emails)
        l.add_value('url', response.url)

        return l.load_item()

The output that I’m getting is that in every single url that scrapy visits 
it founds image and Drops item, which is not case when I run it without 
pipeline.

Any help is appreciated.
​

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to