I need to grab post from a sitemap.xml file. The sitemap.xml file points to
other sitemap files and the secondary sitemap file contains the actual post.
I used sitemap_urls = "secondary-stiemap-file" and sitemap_rules for the
rule to post url. How can i use sitemap spider which crawls the primary
sitemap file, follow the secondary sitemaps and scrape from the post?

 My spider is as follows which works fine with one of the sitemap that main
sitemap file points to.

class MySpider(SitemapSpider):
    name = "example"
    allowed_domains = ['www.example.com']

    sitemap_urls = ["http://sitemaps.example.com/post-sitemap1.xml";]
    sitemap_rules = [('\d{4}/\d{2}/\d{2}/\w+', 'parse_post')]

    def parse_post(self, response):
        item = PostItem()
        item['url'] = response.url
        return item

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to