[Bug 605543] Re: UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0: unexpected end of data

Captain Chaos Sun, 08 Aug 2010 06:56:05 -0700

The tweets are mainly in Dutch, English and Japanese.

You say that "before the data is parsed by htmllib.HTMLParser it must be
unicode", but your modification actually turns the string into a
UTF-8-encoded 8-bit string, not unicode. What's more, a "print type(s)"
in unescape() reveals that *without* the modification the type of the
string passed in *is* unicode.


I'm pretty sure that what happens is this:

* unescape() invokes HTMLParser.save_bgn(), which initialises 
HTMLParser.savedata to an empty 8-bit string
* unescape() invokes HTMLParser.feed (inherited from SGMLParser) with a unicode 
string (m["text"], verified with a "print type(s)")
* the string is concatenated to SGMLParser.rawdata, which started out as an 
empty 8-bit string but now becomes unicode
* feed() invokes goahead()
* goahead() searches rawdata for HTML tags and invokes handle_data() 
(implemented by HTMLParser) for the text parts in between
* handle_data() concatenates the unicode string to savedata, which started out 
as an empty 8-bit string but now becomes unicode
* when goahead() encounters an entity tag, it invokes handle_entityref()
* handle_entityref() invokes convert_entityref() to convert the tag name to the 
corresponding character. it uses the entitydefs table for this. HTMLParser has 
imported entitydefs from htmlentitydefs.py. It contains each entity tag's 
corresponding character as an 8-bit string in the latin-1 encoding, or a 
character reference if the character is not contained in the latin-1 encoding
* handle_entityref() then invokes handle_data() to append the character 
referenced by the entity tag. it passes in the latin-1 encoded 8-bit string it 
got from convert_entityref()
* handle_data() does this:

self.savedata = self.savedata + data

at this point savedata is unicode, but data is an 8-bit string. Python
therefore has to convert the 8-bit string to unicode in order to be able
to append it. it uses the "default encoding" for this. on my system the
default encoding at this point appears to be utf8 (this is borne out by
the error message). the utf8 codec tries to interpret the latin-1
encoded character as utf8 and (correctly) fails

The questions that need answering at this point are:

* Why is the default encoding utf8? Could it have to do with my locale setting 
(which is en_US.utf8)?
* Interestingly, according to the Python documentation the regular default 
encoding is ascii, which would also fail, so why doesn't everyone have this 
problem?
* HTMLParser doesn't work correctly when: 1) the default encoding is not 
latin-1, 2) you offer it unicode strings and 3) the strings contain entity 
tags. My fix remedies this. Is this not a bug which needs fixing?

I'm reverting back to my original fix. It's the only one so far which
results in no error messages at all (at least for twitter). As much as I
would like to, I don't have the time to learn Python and become a
Gwibber developer and unicode expert to get this bug fixed, especially
since I don't think the problem is actually in Gwibber itself.

It would be great if an Ubuntu & Python expert could look at my
reasoning above and see if it holds water and if sgmllib.py and
htmllib.py need to be fixed.

-- 
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0: 
unexpected end of data
https://bugs.launchpad.net/bugs/605543
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 605543] Re: UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0: unexpected end of data

Reply via email to