On Sep 7, 11:01 am, Brian D <brianden...@gmail.com> wrote: > In an HTML page that I'm scraping using urllib2, a \xc2\xa0 > bytestring appears. > > The page's charset = utf-8, and the Chrome browser I'm using displays > the characters as a space. > > The page requires authentication:https://www.nolaready.info/myalertlog.php > > When I try to concatenate strings containing the bytestring, Python > chokes because it refuses to coerce the bytestring into ascii. > > wfile.write('|'.join(valueList)) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position > 163: ordinal not in range(128) > > In searching for help with this issue, I've learned that the > bytestring *might* represent a non-breaking space. > > When I scrape the page using urllib2, however, the characters print > as   in a Windows command prompt (though I wouldn't be surprised if > this is some erroneous attempt by the antiquated command window to > handle something it doesn't understand). > > If I use IDLE to attempt to decode the single byte referenced in the > error message, and convert it into UTF-8, another error message is > generated: > > >>> weird = unicode('\xc2', 'utf-8') > > Traceback (most recent call last): > File "<pyshell#72>", line 1, in <module> > weird = unicode('\xc2', 'utf-8') > UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0: > unexpected end of data > > If I attempt to decode the full bytestring, I don't obtain a human- > readable string (expecting, perhaps, a non-breaking space): > > >>> weird = unicode('\xc2\xa0', 'utf-8') > >>> par = ' - '.join(['This is', weird]) > >>> par > > u'This is - \xa0' > > I suspect that the bytestring isn't UTF-8, but what is it? Latin1? > > >>> weirder = unicode('\xc2\xa0', 'latin1') > >>> weirder > u'\xc2\xa0' > >>> 'This just gets ' + weirder > > u'This just gets \xc2\xa0' > > Or is it a Microsoft bytestring? > > >>> weirder = unicode('\xc2\xa0', 'mbcs') > >>> 'This just gets ' + weirder > > u'This just gets \xc2\xa0' > > None of these codecs seem to work. > > Back to the original purpose, as I'm scraping the page, I'm storing > the field/value pair in a dictionary with each iteration through table > elements on the page. This is all fine, until a value is found that > contains the offending bytestring. I have attempted to coerce all > value strings into an encoding, but Python doesn't seem to like that > when the string is already Unicode: > > valuesDict[fieldString] = unicode(value, 'UTF-8') > TypeError: decoding Unicode is not supported > > The solution I've arrived at is to specify the encoding for value > strings both when reading and writing value strings. > > for k, v in valuesDict.iteritems(): > valuePair = ':'.join([k, v.encode('UTF-8')]) > [snip] ... > wfile.write('|'.join(valueList)) > > I'm not sure I have a question, but does this sound familiar to any > Unicode experts out there? > > How should I handle these odd bytestring values? Am I doing it > correctly, or what could I improve? > > Thanks!
Since it's UTF-8, one should go to one of the UTF-8 pages that describes how to decode it. As it turns out, its unicode hex value is A0, which is indeed a non-breaking space. This is probably as good as any page: http://en.wikipedia.org/wiki/UTF-8 John Roth -- http://mail.python.org/mailman/listinfo/python-list