Package: python-beautifulsoup
Version: 3.0.4-1

Hi,
An example page causing this bug is http://www.vupp.cz/czvupp/
(URL via dmoz, seems to be a food research institute)
it also only happens when I use
  convertEntities=BeautifulSoup.HTML_ENTITIES

the problem seems to be here:
 print UnicodeDammit(b,smartQuotesTo=None).unicode
prints "None". Which will then be fed to the sgml parser in the _feed method.

maybe (!) it could be fixed this way:
--- BeautifulSoup.py    2007-04-10 21:39:11.000000000 +0200
+++ /tmp/BeautifulSoup.py       2007-12-29 19:58:15.000000000 +0100
@@ -958,7 +958,7 @@
             dammit = UnicodeDammit\
                      (markup, [self.fromEncoding, inDocumentEncoding],
                       smartQuotesTo=self.smartQuotesTo)
-            markup = dammit.unicode
+            markup = dammit.unicode or markup
             self.originalEncoding = dammit.originalEncoding
         if markup:
             if self.markupMassage:

but I'm not entirely sure of what dammit is supposed to do. Maybe this
is not the proper way of fixing this. It also leaves
originalEncoding=None.

--- System information. ---
Architecture: i386
Kernel:       Linux 2.6.23.9

Debian Release: lenny/sid
  500 unstable        www.debian-multimedia.org 
  500 unstable        ftp.de.debian.org 
    1 experimental    ftp.de.debian.org 

--- Package information. ---
Depends             (Version) | Installed
=============================-+-===========
python               (>= 2.2) | 2.4.4-6
python-support       (>= 0.2) | 0.7.5

best regards,
Erich Schubert
-- 
    erich@(vitavonni.de|debian.org)    --    GPG Key ID: 4B3A135C    (o_
   To understand recursion you first need to understand recursion.   //\
       Alles verändert sich, sobald man sich selber verändert.       V_/_



Reply via email to