On 12/05/2022 09:25, Adrian Bool wrote:
More XML fun in the morning!
Almost there
1. Is there a way to tell lxml _not_ to add <html><body> and
</body></html> when inserting the header right after <body>?
<body>
<html><body>[header here]</body></html><h1>My title</h1>
Here's the code:
======
body = root.find("body")
if len(body) == 0:
raise Exception("<body> not found.")
body_content = body[0]
#Adrian Bool:
# Note, we can't use HTML parser for the content as it is not a full,
well formed HTML file.
# !!!! Also, this file needs to be encapuslated within a single XML
element, e.g. a <div>
#BAD with malformed HTML
#content_tree = et.parse("block1.html")
#<table border=0 width=100%>
#lxml.etree.XMLSyntaxError: AttValue: " or ' expected
#OK
parser = et.HTMLParser(recover=True)
content_tree = et.parse("block1.html",parser)
content_root = content_tree.getroot()
body.insert(index=0, element=content_root)
======
2. I need to add a header and a footer in each HTML file, and the </div>
is actually located in the footer: Will lxml complain if it's missing in
the header, ie. it's malformed XML/HTML (per your comment above)?
block1.html:
<!-- header -->
<div id="header">
<table border=0 width=100%>
<tr>
<td>blah</td>
</tr>
</table>
</div>
<div class="entrycontent">
block2.html:
<!-- footer -->
<script>
blah
</script>
<gcse:search></gcse:search>
</div>
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com