Re: Delete node childs

Anto Sun, 03 May 2015 13:49:21 -0700

Hi

I have tried the following but I get errors:


    #description = response.xpath('//div[@id="highlighted"]')
    #description = response.xpath('//div[@id="highlighted"]').extract()
    description = response.xpath('//div[@id="highlighted"]')[0]
    #description = response.xpath('//div[@id="highlighted"]')[0].extract()

    #parser = HTMLParser(encoding='utf-8', recover=True)
    #tree = et.parse(StringIO(description), parser)
    tree = et.parse( StringIO(description) )
    #tree = et.parse( description )

    for element in tree.xpath('//*[@class="bullets"]'):
        element.getparent().remove(element)

    print et.tostring(tree, pretty_print=True, xml_declaration=True)

But if it works when I try the following:

    parser = HTMLParser(encoding='utf-8', recover=True)
    tree = et.parse(StringIO(response.body), parser)

    for element in tree.xpath('//*[@id="highlighted"]/*[@class="bullets"]'):
        element.getparent().remove(element)

    # Return all body if no delimit xpath
    print et.tostring(tree.xpath('//div[@id="highlighted"]')[0],
pretty_print=True, xml_declaration=True)

Thanks

2015-05-03 21:47 GMT+02:00 Anto <[email protected]>:

> Hello:
>
> I am trying to capture data from a website and do not have a fixed
> structure, so that I do not think to use xpath for each part. I've been
> hours trying to delete nodes children of the captured xpath, but I can't,
> only occurs to me do so via regular expressions...
>
> The website has the following html:
>
> <div id="highlighted">\n
>     <div class="bullets">\n
>         <p class="headLine"><span>text</span></p>\n
>         <ul>
>             <li>text 1</li>
>             <li>text 2</li>
>             <li>text 3</li>
>             <li>text 4</li>
>         </ul>
>     </div>\n \n\n
>     <p>\n Other text</p>\n
>     <br>
>     <h3>title</h3>\n\n
>     <p>description</p>\n\n
>     <p>\xa0</p>\n\n
>     <h3>title 2</h3>\n\n
>     <ul>
>         <li>optional</li>
>     </ul>
>     <p>
>         <br>Text</p>\n\n
>     <ul>
>         <li>option 1</li>
>         <li>option 2</li>
>         <li>option 3</li>
>     </ul>
>     <h3>title 3</h3>\n\n
>     <p>text</p>\n\n
>     <ul>
>         <li>text</li>
>     </ul>
>     <p>text</p>\n\n
>     <h3>title 4</h3>
> </div>
>
> I have tested with the following options:
>
> response.xpath('//*[@id="highlighted"][not(@class="bullets")]').extract()
> # It returns the html div without making any changes.
>
> response.xpath('//*[@id="highlighted"]/*[not(@class="bullets")]').extract()
> # Delete the contents of the div with class bullets, but it returns me
> everything in an array (by selector *). I need the content in one field,
> for use on the web.
>
> And more several tests but that do not return values...
>
> Is it not possible to get the content of a div by eliminating some
> children?. Possibly I'm trying to do something impossible, believing that
> it can be done without implementing code in python and we need to do it
> using regular expressions.
>
> I need to remove the div with class bullet, the first p node and the last
> node h3. For while I hope if someone can tell me if it is feasible via
> switches or do I have to implement code, I'll get as using regular
> expressions in python (I'm new to this language). Thank you.
>
> Regards
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Delete node childs

Reply via email to