Re: Code to remove text from Scrapy output

Steven Almeroth Sun, 17 Jan 2016 12:48:06 -0800

If you want to get all categories in "tag" you can remove the "take-first" 
predicate [1].  If you want to ignore all markup between two (comment) 
tags, then you might want to do that with Python, not Xpath.  Also 
CrawSpider was removed from "contrib" in Scrapy; same with extractors.


from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from isbullshit.items import IsBullshitItem

class IsBullshitSpider(CrawlSpider):
    name = 'isbullshit'
    start_urls = ['http://sample.com']
    rules = (
        Rule(LinkExtractor(allow=r'page/\d+')),
        Rule(LinkExtractor(allow=r'\w+'), callback='parse_blogpost'),
    )

    def parse_blogpost(self, response):
        item = IsBullshitItem()
        item['title'] = response.select('//h2[@class="post-title 
entry-title"]/text()').extract_first()
        item['tag'] = 
response.select('//ul[@class="post-categories"]/li/a/text()').extract_first()
        item['article_html'] = response.select("//div[@class='entry 
clearfix']").extract_first()

        return item


On Wednesday, December 9, 2015 at 3:24:18 AM UTC-7, VR Tech wrote:
>
> Below is a sample piece of HTML code that I want to scrape with scrapy.
>
>
> <body><h2 class="post-title entry-title">Sample Header</h2>
>     <div class="entry clearfix">
>         <div class="sample1">
>             <p>Hello</p>
>         </div>
>         <!--start comment-->
>         <div class="sample2">
>             <p>World</p>
>         </div>
>         <!--end comment-->
>     </div><ul class="post-categories"><li><a 
> href="123.html">Category1</a></li><li><a 
> href="456.html">Category2</a></li><li><a 
> href="789.html">Category3</a></li></ul></body>
>
>
>
> Right now I am using the below working scrapy code:
>
>
> from scrapy.contrib.spiders import CrawlSpider, Rulefrom 
> scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom 
> scrapy.selector import HtmlXPathSelectorfrom isbullshit.items import 
> IsBullshitItem
> class IsBullshitSpider(CrawlSpider):
>     name = 'isbullshit'
>     start_urls = ['http://sample.com']
>     rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True), 
>         Rule(SgmlLinkExtractor(allow=[r'\w+']), callback='parse_blogpost')]
>
>     def parse_blogpost(self, response):
>         hxs = HtmlXPathSelector(response)
>         item = IsBullshitItem()
>         item['title'] = hxs.select('//h2[@class="post-title 
> entry-title"]/text()').extract()[0]
>         item['tag'] = 
> hxs.select('//ul[@class="post-categories"]/li[1]/a/text()').extract()[0]
>         item['article_html'] = hxs.select("//div[@class='entry 
> clearfix']").extract()[0]
>         return item
>
>
>
> It gives me the following xml output:
>
>
> <?xml version="1.0" encoding="utf-8"?><items>
>     <item>
>
>         <article_html>
>         <div class="entry clearfix">
>         <div class="sample1">
>             <p>Hello</p>
>         </div>
>         <!--start comment-->
>         <div class="sample2">
>             <p>World</p>
>         </div>
>         <!--end comment-->
>         </div>      
>         </article_html>
>
>         <tag>
>         Category1
>         </tag>
>
>         <title>
>         Sample Header
>         </title>
>
>     </item></items>
>
>
>
> I want to know how to achieve the following output:
>
>
> <?xml version="1.0" encoding="utf-8"?><items>
>     <item>
>
>         <article_html>
>         <div class="entry clearfix">
>         <div class="sample1">
>             <p>Hello</p>
>         </div>
>         <!--start comment-->
>         <!--end comment-->
>         </div>      
>         </article_html>
>
>         <tag>
>         Category1,Category2,Category3
>         </tag>
>
>         <title>
>         Sample Header
>         </title>
>
>     </item></items>
>
>
> Note: The number of categories depends on the post. In the above example, 
> there are 3 categories. There could be more or less.
>
> Help would be much appreciated. Cheers.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Code to remove text from Scrapy output

Reply via email to