I noticed that the text_content() method of lxml.html elements returns a 
_ElementUnicodeResult, i.e. a 'smart' string.

However, its getparent(), attrname are None, and is_tail, is_text, is_attribute 
are False. This is the case even if the element contains a single text node. 
The XPath "string()" used in text_content()'s implementation never returns an 
existing text node, but always a new string.

Wouldn't it make more sense for text_content() to return a normal str?
E.g. by adding smart_strings=False to _collect_string_content.

I am not aware of any real issues caused by text_content() returning a 'smart' 
string -- for example, I don't think it can cause any memory leaks, because it 
doesn't seem to have a reference to the original document. But it still seems 
unexpected and perhaps unintentional.

In theory this might be a breaking change, if anyone expects 
elem.text_content().getparent() to exist and return None. But 
https://lxml.de/lxmlhtml.html doesn't mention that text_content() returns a 
'smart' string. 'Smart' strings are only documented at 
https://lxml.de/xpathxslt.html. Given lxml 6.0.0 is in the works, now seemed 
like a good time to suggest this change.

Thanks for reading, and thank you for all your work on lxml.

Tomi
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

Reply via email to