New submission from Damien Baty <damien.b...@gmail.com>:
When rewriting image tags, repoze.bitblt removes the doctype of any (X)HTML
content (cf. attached test). It should not.
I have found a fix for XHTML code (cf. attached patch) by changing how the
content is parsed. However, the bug persists for HTML content (when 'try_html'
is not enforced). I tried to use the same technique as for XHTML (using
lxml.etree.parse() instead of lxml.html.document_fromstring()) but the
transformed
content then always includes a doctype. Perhaps we could then remove it when
it was not present in the original content, but it starts to be a bit more
complicated than it should... (I admit that I did not dig too much in lxml...)
In a nutshell, the attached patch will keep the doctype for XHTML content. For
HTML content, the current (bogus) behaviour is kept (and the doctype is
removed). Malthe (or anyone who uses this package), if you do not object, I'll
commit the patch.
----------
assignedto: dbaty
files: transform.py.patch
messages: 291
nosy: dbaty
priority: bug
status: unread
title: repoze.bitblt removes doctype
__________________________________
Repoze Bugs <b...@bugs.repoze.org>
<http://bugs.repoze.org/issue103>
__________________________________
Index: transform.py
===================================================================
--- transform.py (révision 6960)
+++ transform.py (copie de travail)
@@ -1,4 +1,6 @@
+import lxml.etree
import lxml.html
+from StringIO import StringIO
import urlparse
try:
@@ -17,7 +19,7 @@
if try_xhtml:
try:
parser = lxml.html.XHTMLParser(resolve_entities=False)
- root = lxml.html.document_fromstring(body, parser=parser)
+ root = lxml.etree.parse(StringIO(body), parser)
isxml = True
except lxml.etree.XMLSyntaxError, e:
root = lxml.html.document_fromstring(body)
_______________________________________________
Repoze-dev mailing list
Repoze-dev@lists.repoze.org
http://lists.repoze.org/listinfo/repoze-dev