Hi
I have tried the following but I get errors:
#description = response.xpath('//div[@id="highlighted"]')
#description = response.xpath('//div[@id="highlighted"]').extract()
description = response.xpath('//div[@id="highlighted"]')[0]
#description = response.xpath('//div[@id="highlighted"]')[0].extract()
#parser = HTMLParser(encoding='utf-8', recover=True)
#tree = et.parse(StringIO(description), parser)
tree = et.parse( StringIO(description) )
#tree = et.parse( description )
for element in tree.xpath('//*[@class="bullets"]'):
element.getparent().remove(element)
print et.tostring(tree, pretty_print=True, xml_declaration=True)
But if it works when I try the following:
parser = HTMLParser(encoding='utf-8', recover=True)
tree = et.parse(StringIO(response.body), parser)
for element in tree.xpath('//*[@id="highlighted"]/*[@class="bullets"]'):
element.getparent().remove(element)
# Return all body if no delimit xpath
print et.tostring(tree.xpath('//div[@id="highlighted"]')[0],
pretty_print=True, xml_declaration=True)
Thanks
2015-05-03 21:47 GMT+02:00 Anto <[email protected]>:
> Hello:
>
> I am trying to capture data from a website and do not have a fixed
> structure, so that I do not think to use xpath for each part. I've been
> hours trying to delete nodes children of the captured xpath, but I can't,
> only occurs to me do so via regular expressions...
>
> The website has the following html:
>
> <div id="highlighted">\n
> <div class="bullets">\n
> <p class="headLine"><span>text</span></p>\n
> <ul>
> <li>text 1</li>
> <li>text 2</li>
> <li>text 3</li>
> <li>text 4</li>
> </ul>
> </div>\n \n\n
> <p>\n Other text</p>\n
> <br>
> <h3>title</h3>\n\n
> <p>description</p>\n\n
> <p>\xa0</p>\n\n
> <h3>title 2</h3>\n\n
> <ul>
> <li>optional</li>
> </ul>
> <p>
> <br>Text</p>\n\n
> <ul>
> <li>option 1</li>
> <li>option 2</li>
> <li>option 3</li>
> </ul>
> <h3>title 3</h3>\n\n
> <p>text</p>\n\n
> <ul>
> <li>text</li>
> </ul>
> <p>text</p>\n\n
> <h3>title 4</h3>
> </div>
>
> I have tested with the following options:
>
> response.xpath('//*[@id="highlighted"][not(@class="bullets")]').extract()
> # It returns the html div without making any changes.
>
> response.xpath('//*[@id="highlighted"]/*[not(@class="bullets")]').extract()
> # Delete the contents of the div with class bullets, but it returns me
> everything in an array (by selector *). I need the content in one field,
> for use on the web.
>
> And more several tests but that do not return values...
>
> Is it not possible to get the content of a div by eliminating some
> children?. Possibly I'm trying to do something impossible, believing that
> it can be done without implementing code in python and we need to do it
> using regular expressions.
>
> I need to remove the div with class bullet, the first p node and the last
> node h3. For while I hope if someone can tell me if it is feasible via
> switches or do I have to implement code, I'll get as using regular
> expressions in python (I'm new to this language). Thank you.
>
> Regards
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.