On Apr 1, 8:13 pm, Chris Rebert <c...@rebertia.com> wrote: > On Thu, Apr 1, 2010 at 4:38 AM, Mister Yu <eryan...@gmail.com> wrote: > > On Apr 1, 7:22 pm, Chris Rebert <c...@rebertia.com> wrote: > >> 2010/4/1 Mister Yu <eryan...@gmail.com>: > >> > hi experts, > > >> > i m new to python, i m writing crawlers to extract data from some > >> > chinese websites, and i run into a encoding problem. > > >> > i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4' > >> > which is encoded in "gb2312", > <snip> > > hi, thanks for the tips. > > > but i m still not very sure how to convert a unicode object ** > > u'\xd6\xd0\xce\xc4 ** back to "中文" the string it supposed to be? > > Ah, my apologies! I overlooked something (sorry, it's early in the > morning where I am). > What you have there is ***really*** screwy. It's the 2 Chinese > characters, encoded in gb2312, and then somehow cast *directly* into a > 'unicode' string (which ought never to be done). > > In answer to your original question (after some experimentation): > gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4']) > unicode_string = gb2312_bytes.decode('gb2312') > utf8_bytes = unicode_string.encode('utf-8') #as you wanted > > If possible, I'd look at the code that's giving you that funky > "string" in the first place and see if it can be fixed to give you > either a proper bytestring or proper unicode string rather than the > bastardized mess you're currently having to deal with. > > Apologies again and Cheers, > Chris > --http://blog.rebertia.com
Hi Chris, thanks for the great tips! it works like a charm. i m using the Scrapy project(http://doc.scrapy.org/intro/ tutorial.html) to write my crawler, when it extract data with xpath, it puts the chinese characters directly into the unicode object. thanks again chris, and have a good april fool day. Cheers, Yu -- http://mail.python.org/mailman/listinfo/python-list