Brian D <brianden...@gmail.com> writes: > In an HTML page that I'm scraping using urllib2, a \xc2\xa0 > bytestring appears. > > The page's charset = utf-8, and the Chrome browser I'm using displays > the characters as a space. > > The page requires authentication: > https://www.nolaready.info/myalertlog.php > > When I try to concatenate strings containing the bytestring, Python > chokes because it refuses to coerce the bytestring into ascii. > > wfile.write('|'.join(valueList)) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position > 163: ordinal not in range(128) > > In searching for help with this issue, I've learned that the > bytestring *might* represent a non-breaking space.
It in fact does. > > When I scrape the page using urllib2, however, the characters print > as   in a Windows command prompt (though I wouldn't be surprised if > this is some erroneous attempt by the antiquated command window to > handle something it doesn't understand). Yes, it's trying to interpret that as two cp1252 (or whatever) bytes instead of one unbreakable space. > > If I use IDLE to attempt to decode the single byte referenced in the > error message, and convert it into UTF-8, another error message is > generated: > >>>> weird = unicode('\xc2', 'utf-8') > > Traceback (most recent call last): > File "<pyshell#72>", line 1, in <module> > weird = unicode('\xc2', 'utf-8') > UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0: > unexpected end of data Which is to be expected, as you ripped a UTF-8 escape sequence in half. > > If I attempt to decode the full bytestring, I don't obtain a human- > readable string (expecting, perhaps, a non-breaking space): You obtain a non-breakable space. What do you think it should look like in your terminal? It looks like ... nothing. Because it looks like a space. > >>>> weird = unicode('\xc2\xa0', 'utf-8') >>>> par = ' - '.join(['This is', weird]) >>>> par > u'This is - \xa0' > > I suspect that the bytestring isn't UTF-8, but what is it? Latin1? No, it is UTF-8 > >>>> weirder = unicode('\xc2\xa0', 'latin1') >>>> weirder > u'\xc2\xa0' >>>> 'This just gets ' + weirder > u'This just gets \xc2\xa0' > > Or is it a Microsoft bytestring? This is not weird, this is the python interpreter giving you the representation of a unicode-object when you do not print, so you can see what it looks like. And because you wrongly decoded it as latin1, it's garbage anyway. > >>>> weirder = unicode('\xc2\xa0', 'mbcs') >>>> 'This just gets ' + weirder > u'This just gets \xc2\xa0' > > None of these codecs seem to work. UTF-8 worked just fine. > > Back to the original purpose, as I'm scraping the page, I'm storing > the field/value pair in a dictionary with each iteration through table > elements on the page. This is all fine, until a value is found that > contains the offending bytestring. I have attempted to coerce all > value strings into an encoding, but Python doesn't seem to like that > when the string is already Unicode: > > valuesDict[fieldString] = unicode(value, 'UTF-8') > TypeError: decoding Unicode is not supported > > The solution I've arrived at is to specify the encoding for value > strings both when reading and writing value strings. > > for k, v in valuesDict.iteritems(): > valuePair = ':'.join([k, v.encode('UTF-8')]) > [snip] ... > wfile.write('|'.join(valueList)) > > I'm not sure I have a question, but does this sound familiar to any > Unicode experts out there? > > How should I handle these odd bytestring values? Am I doing it > correctly, or what could I improve? The overall solution is to decode the page or parts of it in whatever decoding it is delivered. You mentioned that the page is delivered in UTF-8, so you should use whatever gives you that information to decode the returned body. Diez -- http://mail.python.org/mailman/listinfo/python-list