If you want to get all categories in "tag" you can remove the "take-first" predicate [1]. If you want to ignore all markup between two (comment) tags, then you might want to do that with Python, not Xpath. Also CrawSpider was removed from "contrib" in Scrapy; same with extractors.
from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from isbullshit.items import IsBullshitItem class IsBullshitSpider(CrawlSpider): name = 'isbullshit' start_urls = ['http://sample.com'] rules = ( Rule(LinkExtractor(allow=r'page/\d+')), Rule(LinkExtractor(allow=r'\w+'), callback='parse_blogpost'), ) def parse_blogpost(self, response): item = IsBullshitItem() item['title'] = response.select('//h2[@class="post-title entry-title"]/text()').extract_first() item['tag'] = response.select('//ul[@class="post-categories"]/li/a/text()').extract_first() item['article_html'] = response.select("//div[@class='entry clearfix']").extract_first() return item On Wednesday, December 9, 2015 at 3:24:18 AM UTC-7, VR Tech wrote: > > Below is a sample piece of HTML code that I want to scrape with scrapy. > > > <body><h2 class="post-title entry-title">Sample Header</h2> > <div class="entry clearfix"> > <div class="sample1"> > <p>Hello</p> > </div> > <!--start comment--> > <div class="sample2"> > <p>World</p> > </div> > <!--end comment--> > </div><ul class="post-categories"><li><a > href="123.html">Category1</a></li><li><a > href="456.html">Category2</a></li><li><a > href="789.html">Category3</a></li></ul></body> > > > > Right now I am using the below working scrapy code: > > > from scrapy.contrib.spiders import CrawlSpider, Rulefrom > scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom > scrapy.selector import HtmlXPathSelectorfrom isbullshit.items import > IsBullshitItem > class IsBullshitSpider(CrawlSpider): > name = 'isbullshit' > start_urls = ['http://sample.com'] > rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True), > Rule(SgmlLinkExtractor(allow=[r'\w+']), callback='parse_blogpost')] > > def parse_blogpost(self, response): > hxs = HtmlXPathSelector(response) > item = IsBullshitItem() > item['title'] = hxs.select('//h2[@class="post-title > entry-title"]/text()').extract()[0] > item['tag'] = > hxs.select('//ul[@class="post-categories"]/li[1]/a/text()').extract()[0] > item['article_html'] = hxs.select("//div[@class='entry > clearfix']").extract()[0] > return item > > > > It gives me the following xml output: > > > <?xml version="1.0" encoding="utf-8"?><items> > <item> > > <article_html> > <div class="entry clearfix"> > <div class="sample1"> > <p>Hello</p> > </div> > <!--start comment--> > <div class="sample2"> > <p>World</p> > </div> > <!--end comment--> > </div> > </article_html> > > <tag> > Category1 > </tag> > > <title> > Sample Header > </title> > > </item></items> > > > > I want to know how to achieve the following output: > > > <?xml version="1.0" encoding="utf-8"?><items> > <item> > > <article_html> > <div class="entry clearfix"> > <div class="sample1"> > <p>Hello</p> > </div> > <!--start comment--> > <!--end comment--> > </div> > </article_html> > > <tag> > Category1,Category2,Category3 > </tag> > > <title> > Sample Header > </title> > > </item></items> > > > Note: The number of categories depends on the post. In the above example, > there are 3 categories. There could be more or less. > > Help would be much appreciated. Cheers. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.