On 12/01/2022 20:49, Dieter Maurer wrote:
.......

when run I see this

$ python tmp/tlp.py
using tostring
xxml=b'<a attr="&amp;mysym; &amp;lt; &amp;amp; &amp;gt; &amp;#33;">aaaaa &amp;mysym; 
&amp;lt; &amp;amp; &amp;gt;
&amp;#33; AAAAA</a>'
ET.tostring(tree)=b'<a attr="&amp;mysym; &amp;lt; &amp;amp; &amp;gt; &amp;#33;">aaaaa 
&amp;mysym; &amp;lt; &amp;amp;
&amp;gt; &amp;#33; AAAAA</a>'

using attributes
tree.text='aaaaa &mysym; &lt; &amp; &gt; &#33; AAAAA'
tree.getchildren()=[]
tree.tail=None

Apparently, the `resolve_entities=False` was not effective: otherwise,
your tree content should have more structure (especially some
entity reference children).

except that the tree knows not to expand the entities using ET.tostring so in some circumstances resolve_entities=False does work.

I expected that the tree would contain the parsed (unexpanded) values, but referencing the actual tree.text/tail/attrib doesn't give the expected results. There's no criticism here, it makes my life a bit easier. If I had wanted the unexpanded values in the attrib/text/tail it would be more of a problem.


`&#<value>` is not an entity reference but a character reference.
It may rightfully be treated differently from entity references.
I understand the difference, but lxml (and perhaps libxml2) doesn't provide a way to turn off character reference expansion. This makes using lxml for source transformation a bit harder since the original text is not preserved.

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to