On Thu, 2008-11-27 at 21:31 -0800, Scott McKellar wrote: > A few days ago I submitted some experimental code for formatting UTF-8 > characters into JSON. It was a drop-in replacement for the > buffer_append_uescape function, and produced almost identical results. > > Now I have a new version of that code. Since the first version is not > in the repository trunk, I am attaching a full file rather than a patch. > The associated header file doesn't need to change.
Very cool, Scott. I'll drop it into a local build and give it a run in the current OpenSRF / Evergreen trunk environment. > > This new version differs in the following way: > > When it encounters a code point too big to fit into 16 bits (after > stripping out the packaging bits), it formats it into a surrogate pair > of four hex digits each, rather than a single set of five or six hex > digits. > > In addition, this new version no longer uses buffer_fadd() to format > hex values. > > The code for constructing surrogate pairs is a slightly simplified version > of a code snippet found at: > > http://www.unicode.org/faq/utf_bom.html > > The code snippet seems to come from a pretty authoritative source. and > my modifications were minimal, consisting mostly of collecting a couple > of constant expressions into constant values. > > In the case of the G clef character (U+1D11E), I verified that my code > translates it to the correct surrogate pair ("\uD834\uDD1E"). > > Unfortunately that's the only character for which I know both the code > point and the corresponding surrogate pair. My Google fu has failed me. > If someone can provide a sample of code points and the corresponding > surrogate pairs, I can do some more testing to make sure that I'm getting > the right answers. I started generating some examples for you using Python; maybe the attached script will be helpful to you in generating other ranges, but here's a snippet of what the script generates for the Ancient Greek Numbers range (http://www.utf8-chartable.de/unicode-utf8-table.pl gives lots of alternate representations): 65856 "\ud800\udd40" GREEK ACROPHONIC ATTIC ONE QUARTER 65857 "\ud800\udd41" GREEK ACROPHONIC ATTIC ONE HALF 65858 "\ud800\udd42" GREEK ACROPHONIC ATTIC ONE DRACHMA 65859 "\ud800\udd43" GREEK ACROPHONIC ATTIC FIVE 65860 "\ud800\udd44" GREEK ACROPHONIC ATTIC FIFTY 65861 "\ud800\udd45" GREEK ACROPHONIC ATTIC FIVE HUNDRED 65862 "\ud800\udd46" GREEK ACROPHONIC ATTIC FIVE THOUSAND 65863 "\ud800\udd47" GREEK ACROPHONIC ATTIC FIFTY THOUSAND 65864 "\ud800\udd48" GREEK ACROPHONIC ATTIC FIVE TALENTS 65865 "\ud800\udd49" GREEK ACROPHONIC ATTIC TEN TALENTS 65866 "\ud800\udd4a" GREEK ACROPHONIC ATTIC FIFTY TALENTS 65867 "\ud800\udd4b" GREEK ACROPHONIC ATTIC ONE HUNDRED TALENTS 65868 "\ud800\udd4c" GREEK ACROPHONIC ATTIC FIVE HUNDRED TALENTS 65869 "\ud800\udd4d" GREEK ACROPHONIC ATTIC ONE THOUSAND TALENTS 65870 "\ud800\udd4e" GREEK ACROPHONIC ATTIC FIVE THOUSAND TALENTS 65871 "\ud800\udd4f" GREEK ACROPHONIC ATTIC FIVE STATERS 65872 "\ud800\udd50" GREEK ACROPHONIC ATTIC TEN STATERS 65873 "\ud800\udd51" GREEK ACROPHONIC ATTIC FIFTY STATERS 65874 "\ud800\udd52" GREEK ACROPHONIC ATTIC ONE HUNDRED STATERS 65875 "\ud800\udd53" GREEK ACROPHONIC ATTIC FIVE HUNDRED STATERS 65876 "\ud800\udd54" GREEK ACROPHONIC ATTIC ONE THOUSAND STATERS 65877 "\ud800\udd55" GREEK ACROPHONIC ATTIC TEN THOUSAND STATERS 65878 "\ud800\udd56" GREEK ACROPHONIC ATTIC FIFTY THOUSAND STATERS 65879 "\ud800\udd57" GREEK ACROPHONIC ATTIC TEN MNAS 65880 "\ud800\udd58" GREEK ACROPHONIC HERAEUM ONE PLETHRON 65881 "\ud800\udd59" GREEK ACROPHONIC THESPIAN ONE 65882 "\ud800\udd5a" GREEK ACROPHONIC HERMIONIAN ONE 65883 "\ud800\udd5b" GREEK ACROPHONIC EPIDAUREAN TWO 65884 "\ud800\udd5c" GREEK ACROPHONIC THESPIAN TWO 65885 "\ud800\udd5d" GREEK ACROPHONIC CYRENAIC TWO DRACHMAS 65886 "\ud800\udd5e" GREEK ACROPHONIC EPIDAUREAN TWO DRACHMAS 65887 "\ud800\udd5f" GREEK ACROPHONIC TROEZENIAN FIVE 65888 "\ud800\udd60" GREEK ACROPHONIC TROEZENIAN TEN 65889 "\ud800\udd61" GREEK ACROPHONIC TROEZENIAN TEN ALTERNATE FORM 65890 "\ud800\udd62" GREEK ACROPHONIC HERMIONIAN TEN 65891 "\ud800\udd63" GREEK ACROPHONIC MESSENIAN TEN 65892 "\ud800\udd64" GREEK ACROPHONIC THESPIAN TEN 65893 "\ud800\udd65" GREEK ACROPHONIC THESPIAN THIRTY 65894 "\ud800\udd66" GREEK ACROPHONIC TROEZENIAN FIFTY 65895 "\ud800\udd67" GREEK ACROPHONIC TROEZENIAN FIFTY ALTERNATE FORM 65896 "\ud800\udd68" GREEK ACROPHONIC HERMIONIAN FIFTY 65897 "\ud800\udd69" GREEK ACROPHONIC THESPIAN FIFTY 65898 "\ud800\udd6a" GREEK ACROPHONIC THESPIAN ONE HUNDRED 65899 "\ud800\udd6b" GREEK ACROPHONIC THESPIAN THREE HUNDRED 65900 "\ud800\udd6c" GREEK ACROPHONIC EPIDAUREAN FIVE HUNDRED 65901 "\ud800\udd6d" GREEK ACROPHONIC TROEZENIAN FIVE HUNDRED 65902 "\ud800\udd6e" GREEK ACROPHONIC THESPIAN FIVE HUNDRED 65903 "\ud800\udd6f" GREEK ACROPHONIC CARYSTIAN FIVE HUNDRED 65904 "\ud800\udd70" GREEK ACROPHONIC NAXIAN FIVE HUNDRED 65905 "\ud800\udd71" GREEK ACROPHONIC THESPIAN ONE THOUSAND 65906 "\ud800\udd72" GREEK ACROPHONIC THESPIAN FIVE THOUSAND 65907 "\ud800\udd73" GREEK ACROPHONIC DELPHIC FIVE MNAS 65908 "\ud800\udd74" GREEK ACROPHONIC STRATIAN FIFTY MNAS 65909 "\ud800\udd75" GREEK ONE HALF SIGN 65910 "\ud800\udd76" GREEK ONE HALF SIGN ALTERNATE FORM 65911 "\ud800\udd77" GREEK TWO THIRDS SIGN 65912 "\ud800\udd78" GREEK THREE QUARTERS SIGN 65913 "\ud800\udd79" GREEK YEAR SIGN 65914 "\ud800\udd7a" GREEK TALENT SIGN 65915 "\ud800\udd7b" GREEK DRACHMA SIGN 65916 "\ud800\udd7c" GREEK OBOL SIGN 65917 "\ud800\udd7d" GREEK TWO OBOLS SIGN 65918 "\ud800\udd7e" GREEK THREE OBOLS SIGN 65919 "\ud800\udd7f" GREEK FOUR OBOLS SIGN 65920 "\ud800\udd80" GREEK FIVE OBOLS SIGN 65921 "\ud800\udd81" GREEK METRETES SIGN 65922 "\ud800\udd82" GREEK KYATHOS BASE SIGN 65923 "\ud800\udd83" GREEK LITRA SIGN 65924 "\ud800\udd84" GREEK OUNKIA SIGN 65925 "\ud800\udd85" GREEK XESTES SIGN 65926 "\ud800\udd86" GREEK ARTABE SIGN 65927 "\ud800\udd87" GREEK AROURA SIGN 65928 "\ud800\udd88" GREEK GRAMMA SIGN 65929 "\ud800\udd89" GREEK TRYBLION BASE SIGN 65930 "\ud800\udd8a" GREEK ZERO SIGN Also, as a sanity check, I threw in a chunk of the musical symbols range: 119060 "\ud834\udd14" MUSICAL SYMBOL BRACE 119061 "\ud834\udd15" MUSICAL SYMBOL BRACKET 119062 "\ud834\udd16" MUSICAL SYMBOL ONE-LINE STAFF 119063 "\ud834\udd17" MUSICAL SYMBOL TWO-LINE STAFF 119064 "\ud834\udd18" MUSICAL SYMBOL THREE-LINE STAFF 119065 "\ud834\udd19" MUSICAL SYMBOL FOUR-LINE STAFF 119066 "\ud834\udd1a" MUSICAL SYMBOL FIVE-LINE STAFF 119067 "\ud834\udd1b" MUSICAL SYMBOL SIX-LINE STAFF 119068 "\ud834\udd1c" MUSICAL SYMBOL SIX-STRING FRETBOARD 119069 "\ud834\udd1d" MUSICAL SYMBOL FOUR-STRING FRETBOARD 119070 "\ud834\udd1e" MUSICAL SYMBOL G CLEF 119071 "\ud834\udd1f" MUSICAL SYMBOL G CLEF OTTAVA ALTA 119072 "\ud834\udd20" MUSICAL SYMBOL G CLEF OTTAVA BASSA 119073 "\ud834\udd21" MUSICAL SYMBOL C CLEF 119074 "\ud834\udd22" MUSICAL SYMBOL F CLEF 119075 "\ud834\udd23" MUSICAL SYMBOL F CLEF OTTAVA ALTA 119076 "\ud834\udd24" MUSICAL SYMBOL F CLEF OTTAVA BASSA 119077 "\ud834\udd25" MUSICAL SYMBOL DRUM CLEF-1 119078 "\ud834\udd26" MUSICAL SYMBOL DRUM CLEF-2 119079 "\ud834\udd27" The "G CLEF" matches up, so it looks trustworthy to me.
#!/usr/bin/env python import simplejson import unicodedata """ Generate surrogate pairs for a range of non-BMP Unicode chars There's probably a better way to do this than using simplejson, but oh well... """ start = 65856 # start of ancient greek numbers end = 66111 # start of ancient greek numbers ancient_greek_numbers = range(start, end) for i in ancient_greek_numbers: grk = unichr(i) print("%d %s %s" % (i, simplejson.dumps(grk), unicodedata.name(grk, '')))