[lxml] Re: Adding HTML inside XML

Adrian Bool Thu, 18 Aug 2022 21:53:37 -0700

Hi Karl,

You're not parsing the context_string as XML or HTML; so lxml will be thinking 
its just some text that looks horribly like XML but is not XML and therefore 
needs to be escaped to be included within XML.


The following:

import lxml.etree as etree

content_text = '<p>line one</p><p>line two</p>'
en_note_el = etree.XML(f'<en-note>{content_text}</en-note>')
en_note_doctype = '<!DOCTYPE en-note SYSTEM 
"http://xml.evernote.com/pub/enml2.dtd";>'
en_note_str = etree.tostring(en_note_el, encoding='UTF-8', method="xml", 
xml_declaration=True,
                            pretty_print=False, standalone=False, 
doctype=en_note_doctype)

content_el = etree.Element('content')
content_el.text = etree.CDATA(en_note_str)

print(etree.tostring(content_el).decode('utf8'))


Produces the output:

<content><![CDATA[<?xml version='1.0' encoding='UTF-8' standalone='no'?>
<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd";>
<en-note><p>line one</p><p>line two</p></en-note>]]></content>

Which would expect is what you're after?

Cheers,

aid


> On 18 Aug 2022, at 15:57, k...@cs.stanford.edu wrote:
> 
> Hello, I need to add some HTML inside XML. The result should look like this:
> 
> <content>
>     <![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="no"?>
>     <!DOCTYPE en-note SYSTEM 
> "http://xml.evernote.com/pub/enml2.dtd";><en-note><p>line one</p><p>line 
> two</p></en-note>]]>
> </content>
> 
> the code i'm using is this:
>    # read html from file - result is :
>    content_text = '<p>line one</p><p>line two</p>'
> 
>    en_note_el = etree.Element('en-note')
>    en_note_el.text = content_text
>    en_note_doctype = '<!DOCTYPE en-note SYSTEM 
> "http://xml.evernote.com/pub/enml2.dtd";>'
>    en_note_str = etree.tostring(en_note_el, encoding='UTF-8', method="xml", 
> xml_declaration=True,
>                                      pretty_print=False, standalone=False, 
> doctype=en_note_doctype)
> 
>    content_el = etree.SubElement(note_el, 'content')
>    content_el.text = etree.CDATA(en_note_str)
> ==
> 
> This works, except the included HTML in the text element of en-note is 
> escaped. Can you help me figure how to not have it be escaped? The contents 
> inside the <en-note> tags are supposed to be valid HTML, but without any 
> <html> or <body> sections, and there isn't really a root element.
> _______________________________________________
> lxml - The Python XML Toolkit mailing list -- lxml@python.org
> To unsubscribe send an email to lxml-le...@python.org
> https://mail.python.org/mailman3/lists/lxml.python.org/
> Member address: a...@logic.org.uk

_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

[lxml] Re: Adding HTML inside XML

Reply via email to