request.meta updated incorrectly?

Narunas Krasauskas Mon, 13 Feb 2017 06:31:44 -0800

Hello,


I'm trying to log crawled paths to the meta attribute *req_path*:

import scrapy
from scrapy.linkextractors import LinkExtractor


class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["www.iana.org"]
    start_urls = ['http://www.iana.org/']
    request_path_css = dict(
        main_menu = r'#home-panel-domains > h2',
        domain_names = r'#main_right > p',
    )


    def links(self, response, restrict_css=None):
        lex = LinkExtractor(
            allow_domains=self.allowed_domains,
            restrict_css=restrict_css)
        return lex.extract_links(response)


    def requests(self, response, css, cb, append=True):
        links = [link for link in self.links(response, css)]
        for link in links:
            request = scrapy.Request(
                url=link.url,
                callback=cb)
            if append:
                request.meta['req_path'] = response.meta['req_path']
                request.meta['req_path'].append(dict(txt=link.text, url=link
.url))
            else:
                request.meta['req_path'] = [dict(txt=link.text, url=link.url
)]
            yield request


    def parse(self, response):
        #self.logger.warn('## Request path: %s', response.meta['req_path'])
        css = self.request_path_css['main_menu']
        return self.requests(response, css, self.domain_names, False)


    def domain_names(self, response):
        #self.logger.warn('## Request path: %s', response.meta['req_path'])
        css = self.request_path_css['domain_names']
        return self.requests(response, css, self.domain_names_parser)


    def domain_names_parser(self, response):
        self.logger.warn('## Request path: %s', response.meta['req_path'])


Output:

$ scrapy crawl -L WARN example
2017-02-13 11:06:37 [example] WARNING: ## Request path: [{'url': 
'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 
'http://www.iana.org/domains/root', 'txt': 'The DNS Root Zone'}, {'url': 
'http://www.iana.org/domains/int', 'txt': '.INT'}, {'url': 
'http://www.iana.org/domains/arpa', 'txt': '.ARPA'}, {'url': 
'http://www.iana.org/domains/idn-tables', 'txt': 'IDN Practices 
Repository'}, {'url': 'http://www.iana.org/dnssec', 'txt': 'Root Key 
Signing Key'}, {'url': 'http://www.iana.org/domains/special', 'txt': 
'Special Purpose Domains'}]
2017-02-13 11:06:37 [example] WARNING: ## Request path: [{'url': 
'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 
'http://www.iana.org/domains/root', 'txt': 'The DNS Root Zone'}, {'url': 
'http://www.iana.org/domains/int', 'txt': '.INT'}, {'url': 
'http://www.iana.org/domains/arpa', 'txt': '.ARPA'}, {'url': 
'http://www.iana.org/domains/idn-tables', 'txt': 'IDN Practices 
Repository'}, {'url': 'http://www.iana.org/dnssec', 'txt': 'Root Key 
Signing Key'}, {'url': 'http://www.iana.org/domains/special', 'txt': 
'Special Purpose Domains'}]
2017-02-13 11:06:37 [example] WARNING: ## Request path: [{'url': 
'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 
'http://www.iana.org/domains/root', 'txt': 'The DNS Root Zone'}, {'url': 
'http://www.iana.org/domains/int', 'txt': '.INT'}, {'url': 
'http://www.iana.org/domains/arpa', 'txt': '.ARPA'}, {'url': 
'http://www.iana.org/domains/idn-tables', 'txt': 'IDN Practices 
Repository'}, {'url': 'http://www.iana.org/dnssec', 'txt': 'Root Key 
Signing Key'}, {'url': 'http://www.iana.org/domains/special', 'txt': 
'Special Purpose Domains'}]
2017-02-13 11:06:37 [example] WARNING: ## Request path: [{'url': 
'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 
'http://www.iana.org/domains/root', 'txt': 'The DNS Root Zone'}, {'url': 
'http://www.iana.org/domains/int', 'txt': '.INT'}, {'url': 
'http://www.iana.org/domains/arpa', 'txt': '.ARPA'}, {'url': 
'http://www.iana.org/domains/idn-tables', 'txt': 'IDN Practices 
Repository'}, {'url': 'http://www.iana.org/dnssec', 'txt': 'Root Key 
Signing Key'}, {'url': 'http://www.iana.org/domains/special', 'txt': 
'Special Purpose Domains'}]
2017-02-13 11:06:37 [example] WARNING: ## Request path: [{'url': 
'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 
'http://www.iana.org/domains/root', 'txt': 'The DNS Root Zone'}, {'url': 
'http://www.iana.org/domains/int', 'txt': '.INT'}, {'url': 
'http://www.iana.org/domains/arpa', 'txt': '.ARPA'}, {'url': 
'http://www.iana.org/domains/idn-tables', 'txt': 'IDN Practices 
Repository'}, {'url': 'http://www.iana.org/dnssec', 'txt': 'Root Key 
Signing Key'}, {'url': 'http://www.iana.org/domains/special', 'txt': 
'Special Purpose Domains'}]
2017-02-13 11:06:38 [example] WARNING: ## Request path: [{'url': 
'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 
'http://www.iana.org/domains/root', 'txt': 'The DNS Root Zone'}, {'url': 
'http://www.iana.org/domains/int', 'txt': '.INT'}, {'url': 
'http://www.iana.org/domains/arpa', 'txt': '.ARPA'}, {'url': 
'http://www.iana.org/domains/idn-tables', 'txt': 'IDN Practices 
Repository'}, {'url': 'http://www.iana.org/dnssec', 'txt': 'Root Key 
Signing Key'}, {'url': 'http://www.iana.org/domains/special', 'txt': 
'Special Purpose Domains'}]


This is not what I expected, as I would like to have only the last URL in 
*response.meta['req_path'][1]*, however all URLs from the last page somehow 
have found their way to the list.

In other words, the expected output is such as:

[{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 
'http://www.iana.org/domains/root', 'txt': 'The DNS Root Zone'}]
[{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 
'http://www.iana.org/domains/int', 'txt': '.INT'}]
[{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 
'http://www.iana.org/domains/arpa', 'txt': '.ARPA'}]
[{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 
'http://www.iana.org/domains/idn-tables', 'txt': 'IDN Practices Repository'
}]
[{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 
'http://www.iana.org/dnssec', 'txt': 'Root Key Signing Key'}]
[{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 
'http://www.iana.org/domains/special', 'txt': 'Special Purpose Domains'}]


I appreciate I don't quite understand how Scrapy works, as I think my 
Python logic is correct. 


Please advise.


Narunas

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

request.meta updated incorrectly?

Reply via email to