New submission from Damien Baty <>:

When rewriting image tags, repoze.bitblt removes the doctype of any (X)HTML 
content (cf. attached test). It should not.

I have found a fix for XHTML code (cf. attached patch) by changing how the 
content is parsed. However, the bug persists for HTML content (when 'try_html' 
is not enforced). I tried to use the same technique as for XHTML (using 
lxml.etree.parse() instead of lxml.html.document_fromstring()) but the 
content then always includes a doctype. Perhaps we could then remove it when 
it was not present in the original content, but it starts to be a bit more 
complicated than it should... (I admit that I did not dig too much in lxml...)

In a nutshell, the attached patch will keep the doctype for XHTML content. For 
HTML content, the current (bogus) behaviour is kept (and the doctype is 
removed). Malthe (or anyone who uses this package), if you do not object, I'll 
commit the patch.

assignedto: dbaty
messages: 291
nosy: dbaty
priority: bug
status: unread
title: repoze.bitblt removes doctype

Repoze Bugs <>
---	(révision 6960)
+++	(copie de travail)
@@ -1,4 +1,6 @@
+import lxml.etree
 import lxml.html
+from StringIO import StringIO
 import urlparse
@@ -17,7 +19,7 @@
     if try_xhtml:
             parser = lxml.html.XHTMLParser(resolve_entities=False)
-            root = lxml.html.document_fromstring(body, parser=parser)
+            root = lxml.etree.parse(StringIO(body), parser)
             isxml = True
         except lxml.etree.XMLSyntaxError, e:
             root = lxml.html.document_fromstring(body)
Repoze-dev mailing list

Reply via email to