On 12/01/2022 20:49, Dieter Maurer wrote:
.......
when run I see this
$ python tmp/tlp.py
using tostring
xxml=b'<a attr="&mysym; &lt; &amp; &gt; &#33;">aaaaa &mysym;
&lt; &amp; &gt;
&#33; AAAAA</a>'
ET.tostring(tree)=b'<a attr="&mysym; &lt; &amp; &gt; &#33;">aaaaa
&mysym; &lt; &amp;
&gt; &#33; AAAAA</a>'
using attributes
tree.text='aaaaa &mysym; < & > ! AAAAA'
tree.getchildren()=[]
tree.tail=None
Apparently, the `resolve_entities=False` was not effective: otherwise,
your tree content should have more structure (especially some
entity reference children).
except that the tree knows not to expand the entities using ET.tostring so in some circumstances resolve_entities=False
does work.
I expected that the tree would contain the parsed (unexpanded) values, but referencing the actual tree.text/tail/attrib
doesn't give the expected results. There's no criticism here, it makes my life a bit easier. If I had wanted the
unexpanded values in the attrib/text/tail it would be more of a problem.
`&#<value>` is not an entity reference but a character reference.
It may rightfully be treated differently from entity references.
I understand the difference, but lxml (and perhaps libxml2) doesn't provide a way to turn off character reference
expansion. This makes using lxml for source transformation a bit harder since the original text is not preserved.
--
https://mail.python.org/mailman/listinfo/python-list