On 12/05/2022 09:25, Adrian Bool wrote:
More XML fun in the morning!

Almost there

1. Is there a way to tell lxml _not_ to add <html><body> and </body></html> when inserting the header right after <body>?

<body>
<html><body>[header here]</body></html><h1>My title</h1>

Here's the code:
======
body = root.find("body")
if len(body) == 0:
    raise Exception("<body> not found.")

body_content = body[0]

#Adrian Bool:
# Note, we can't use HTML parser for the content as it is not a full, well formed HTML file. # !!!! Also, this file needs to be encapuslated within a single XML element, e.g. a <div>

#BAD with malformed HTML
#content_tree = et.parse("block1.html")
#<table border=0 width=100%>
#lxml.etree.XMLSyntaxError: AttValue: " or ' expected

#OK
parser = et.HTMLParser(recover=True)
content_tree = et.parse("block1.html",parser)
content_root = content_tree.getroot()

body.insert(index=0, element=content_root)
======

2. I need to add a header and a footer in each HTML file, and the </div> is actually located in the footer: Will lxml complain if it's missing in the header, ie. it's malformed XML/HTML (per your comment above)?

block1.html:
<!-- header -->
<div id="header">
    <table border=0 width=100%>
        <tr>
            <td>blah</td>
        </tr>
    </table>
</div>
<div class="entrycontent">

block2.html:
<!-- footer -->
    <script>
    blah
    </script>
    <gcse:search></gcse:search>
</div>
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

Reply via email to