Re: converting html escape sequences to unicode characters

Kent Johnson Thu, 09 Dec 2004 17:30:05 -0800

harrelson wrote:

I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8.  Stuff like:

&#48708;
&#54665;
&#44592;
&#47196;
&#48372;
&#45244;
&#44144;
&#50640;
&#50836;
&#45236;
&#47732;
&#44552;
&#51060;
&#50620;
&#47560;
&#51648;
&#51104;

Anyone know what the decimal is representing?  It doesn't seem to
equate to a unicode codepoint...


In well-formed HTML (!) these should be the decimal values of Unicode 
characters. See
http://www.w3.org/TR/html4/charset.html#h-5.3.1

These characters appear to be Hangul Syllables:
http://www.unicode.org/charts/PDF/UAC00.pdf

import unicodedata

for num in nums:
    print num, unicodedata.name(unichr(num), 'Unknown')

=>
48708 HANGUL SYLLABLE BI
54665 HANGUL SYLLABLE HAENG
44592 HANGUL SYLLABLE GI
47196 HANGUL SYLLABLE RO
48372 HANGUL SYLLABLE BO
45244 HANGUL SYLLABLE NAEL
44144 HANGUL SYLLABLE GEO
50640 HANGUL SYLLABLE E
50836 HANGUL SYLLABLE YO
45236 HANGUL SYLLABLE NAE
47732 HANGUL SYLLABLE MYEON
44552 HANGUL SYLLABLE GEUM
51060 HANGUL SYLLABLE I
50620 HANGUL SYLLABLE EOL
47560 HANGUL SYLLABLE MA
51648 HANGUL SYLLABLE JI
51104 HANGUL SYLLABLE JAM

Kent
--
http://mail.python.org/mailman/listinfo/python-list

Re: converting html escape sequences to unicode characters

Reply via email to