Below is a sample piece of HTML code that I want to scrape with scrapy.


<body> <h2 class="post-title entry-title">Sample Header</h2> <div class="entry 
clearfix"> <div class="sample1"> <p>Hello</p> </div> <!--start comment--> <div 
class="sample2"> <p>World</p> </div> <!--end comment--> </div> <ul 
class="post-categories"> <li><a href="123.html">Category1</a></li> <li><a 
href="456.html">Category2</a></li> <li><a href="789.html">Category3</a></li> 
</ul> </body>


Right now I am using the below working scrapy code:


from scrapy.contrib.spiders import CrawlSpider, Rule from 
scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from 
scrapy.selector import HtmlXPathSelector from isbullshit.items import 
IsBullshitItem class IsBullshitSpider(CrawlSpider): name = 'isbullshit' 
start_urls = ['http://sample.com'] rules = 
[Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True), 
Rule(SgmlLinkExtractor(allow=[r'\w+']), callback='parse_blogpost')] def 
parse_blogpost(self, response): hxs = HtmlXPathSelector(response) item = 
IsBullshitItem() item['title'] = hxs.select('//h2[@class="post-title 
entry-title"]/text()').extract()[0] item['tag'] = 
hxs.select('//ul[@class="post-categories"]/li[1]/a/text()').extract()[0] 
item['article_html'] = hxs.select("//div[@class='entry 
clearfix']").extract()[0] return item


It gives me the following xml output:


<?xml version="1.0" encoding="utf-8"?> <items> <item> <article_html> <div 
class="entry clearfix"> <div class="sample1"> <p>Hello</p> </div> <!--start 
comment--> <div class="sample2"> <p>World</p> </div> <!--end comment--> </div> 
</article_html> <tag> Category1 </tag> <title> Sample Header </title> </item> 
</items>


I want to know how to achieve the following output:


<?xml version="1.0" encoding="utf-8"?> <items> <item> <article_html> <div 
class="entry clearfix"> <div class="sample1"> <p>Hello</p> </div> <!--start 
comment--> <!--end comment--> </div> </article_html> <tag> 
Category1,Category2,Category3 </tag> <title> Sample Header </title> </item> 
</items>


Note: The number of categories depends on the post. In the above example, 
there are 3 categories. There could be more or less.

Help would be much appreciated. Cheers.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to