I need to grab post from a sitemap.xml file. The sitemap.xml file points to other sitemap files and the secondary sitemap file contains the actual post. I used sitemap_urls = "secondary-stiemap-file" and sitemap_rules for the rule to post url. How can i use sitemap spider which crawls the primary sitemap file, follow the secondary sitemaps and scrape from the post?
My spider is as follows which works fine with one of the sitemap that main sitemap file points to. class MySpider(SitemapSpider): name = "example" allowed_domains = ['www.example.com'] sitemap_urls = ["http://sitemaps.example.com/post-sitemap1.xml"] sitemap_rules = [('\d{4}/\d{2}/\d{2}/\w+', 'parse_post')] def parse_post(self, response): item = PostItem() item['url'] = response.url return item -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.