This is good news! While the C changes aren't dead simple, they are not bad. I think they could be slightly simplified. One thing that would make it much easier to read the diff is if you create it without whitespace changes. So like this:
svn diff -x -w As for the Tcl changes, I think we can include those now in Pd-devel, as long they can work ok with unchanged C code. Then once the new Tcl GUI is included we can refactor the C side of things with things like this. One other thing, it seems that the ASCII char are handled differently than the UTF-8 chars in g_rtext.c, I think you could use instead wcswidth(), mbstowcs() or other UTF-8 functions as described in the UTF-8 FAQ http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod .hc On Feb 17, 2009, at 5:53 PM, Bryan Jurish wrote: > morning Hans, morning list, > > So I've tried to get the pd-devel 0.41.4 branch to use UTF-8 across > the > board. The TK side was easy (as Hans predicted); really just a call > to > {fconfigure} in ::pd_connect::configure_socket. I also set the output > encoding to UTF-8 on Tk's stdout and stderr, for debugging purposes; > it's probably wisest to leave those encodings at the default (user's > current locale LC_CTYPE) for a release-like version. > > The C side is much hairier. I think I've got things basically working > (at least for message boxes and comments), but it has so far required > changes in: > > FILE: g_editor.c > + changed handling of <Key> events as passed to the C side to generate > UTF-8 symbol-strings rather than single-byte stringlets. > > + currently use sprintf("%C") to get the UTF-8 string for the > codepoint > passed from Tk; a safer (and not too hard) way would be to pass the > actual UTF-8 string from Tk and just copy that: this would avoid the > m_pd.c hacks forcing LC_CTYPE=en_US.UTF-8 (see below). Another option > would be actually just writing (or borrowing) the code to generate > UTF-8 > strings from Unicode codepoints. It's pretty simple stuff; I've still > got the guts of it somewhere (only written for latin-1 so far, but the > principle is the same for all codepoints). > > FILE: m_pd.c > + added calls to setlocale() to set LC_CTYPE to en_US.UTF-8; this is > an > ugly stinky nasty hack to get sprintf("%C") to output a UTF-8 encoded > string from an unicode codepoint int, as called by canvas_key() in > g_editor.c > > FILE: g_rtext.c > + added an 'else if' clause in rtext_key() to handle unicode > codepoints > as values of the 'keynum' parameter. should also be safe for any 8- > bit > fixed-width encoding. > > FILE: pd.tk > + set system encoding, also output encoding for stdout, stderr to > UTF-8 > > Attached is a screenshot and a test patch. UTF-8 input from the > keyboard works with the test patch, and gets carried through > properly to > the .pd file (and back on load). > > I'd like to get symbol atoms working too (haven't tried yet), but > there > are still some nasty buglets with comments and message boxes, mostly > that editing any multibyte characters is very tricky: looks like the > Tk > point (cursor) and selection are expressed in characters, and Pd's C > side is still thinking in bytes, though I'm totally ignorant of > where or > how that can be changed. A non-critical buglet with the same cause > (probably) is that the C side is computing the required width for > message boxes based on byte lengths, not character lengths, so message > boxes containing multibyte characters look too wide. I could live > with > that, but the editing thing is a real pain... > > I've attached a diff of my changes against branches/pd-devel/0.41.4/ > src > (please excuse commented-out debugging code), in case anyone wants to > try this stuff out. Since it's not working, I'm reluctant to check > anything into the pd-devel/0.41.4 branch yet -- should I branch again > for a work in progress, or do we just pass diffs around for now? > > marmosets, > Bryan > > On 2009-02-12 06:24:44, Hans-Christoph Steiner <[email protected]> > appears to > have written: >> On Feb 11, 2009, at 6:34 AM, Bryan Jurish wrote: >>> On 2009-02-11 03:04:34, Hans-Christoph Steiner <[email protected]> >>> appears to >>>> This is something that I would really like to have working >>>> properly in >>>> Pd-devel. Tcl/Tk is natively UTF-8, so it seems that we should >>>> support >>>> UTF-8 in Pd. Anyone feel like trying to fix it? > > -- > Bryan Jurish "There is *always* one more > bug." > [email protected] -Lubarsky's Law of Cybernetic > Entomology > <test-utf8.pd><test-utf8.png>Index: m_pd.c > =================================================================== > --- m_pd.c (revision 10779) > +++ m_pd.c (working copy) > @@ -295,6 +295,18 @@ > void glob_init(void); > void garray_init(void); > > +/*--BEGIN moo--*/ > +#include <locale.h> > +void locale_init(void) { > + setlocale(LC_ALL,""); > + setlocale(LC_NUMERIC,"C"); > + setlocale(LC_CTYPE,"en_US.UTF-8"); > + /* > + printf("moo: locale=%s\n", setlocale(LC_ALL,NULL)); > + printf("moo: LC_CTYPE=%s\n", setlocale(LC_CTYPE,NULL)); > + */ > +} > + > void pd_init(void) > { > mess_init(); > @@ -302,5 +314,5 @@ > conf_init(); > glob_init(); > garray_init(); > + locale_init(); /*-- moo --*/ > } > - > Index: g_editor.c > =================================================================== > --- g_editor.c (revision 10779) > +++ g_editor.c (working copy) > @@ -1468,9 +1468,16 @@ > gotkeysym = av[1].a_w.w_symbol; > else if (av[1].a_type == A_FLOAT) > { > + /*-- moo: old > char buf[3]; > - sprintf(buf, "%c", (int)(av[1].a_w.w_float)); > + sprintf(buf, "%c", (int)(av[1].a_w.w_float)); > gotkeysym = gensym(buf); > + --*/ > + char buf[8]; > + sprintf(buf, "%C", (int)(av[1].a_w.w_float)); > + /*printf("moo: charcode %%d=%d, %%c=%c, %%C=%C\n", (int) > (av[1].a_w.w_float), (int)(av[1].a_w.w_float), (int) > (av[1].a_w.w_float));*/ > + /*printf("moo: buf='%s'\n", buf);*/ > + gotkeysym = gensym(buf); > } > else gotkeysym = gensym("?"); > fflag = (av[0].a_type == A_FLOAT ? av[0].a_w.w_float : 0); > Index: pd_connect.tcl > =================================================================== > --- pd_connect.tcl (revision 10779) > +++ pd_connect.tcl (working copy) > @@ -11,6 +11,10 @@ > > proc ::pd_connect::configure_socket {sock} { > fconfigure $sock -blocking 0 -buffering line > +##--moo > + fconfigure $sock -encoding utf-8 > +# puts "moo: fconfigure socket -encoding = [fconfigure $sock - > encoding]" > +##--/moo > fileevent $sock readable {::pd_connect::pd_readsocket ""} > } > > @@ -50,6 +54,11 @@ > proc ::pd_connect::pdsend {message} { > variable pd_socket > append message \; > +##--moo > +# if {[lindex $message 1] != {motion}} { > +# puts "moo: pdsend enc={[fconfigure $pd_socket -encoding]} > msg={$message}" > +# } > +##--/moo > if {[catch {puts $pd_socket $message} errorname]} { > puts stderr "pdsend errorname: >>$errorname<<" > error "Not connected to 'pd' process" > @@ -64,6 +73,9 @@ > exit > } > append cmd_from_pd [read $pd_socket] > +##--moo > +# puts "moo: pd_readsocket enc={[fconfigure $pd_socket - > encoding]} cmd_from_pd={$cmd_from_pd}" > +##--/moo > if {[catch {uplevel #0 $cmd_from_pd} errorname]} { > global errorInfo > puts stderr "errorname: >>$errorname<<" > Index: pd.tk > =================================================================== > --- pd.tk (revision 10779) > +++ pd.tk (working copy) > @@ -152,6 +152,15 @@ > # [string range \ > # [registry get {HKEY_CURRENT_USER\Control > Panel\International} > sLanguage] 0 1] ] > #} > + > +##--moo > + encoding system utf-8 > + fconfigure stderr -encoding utf-8 > + fconfigure stdout -encoding utf-8 > + puts "moo: encoding system = [encoding system]" > + puts "moo: encoding stderr = [fconfigure stderr -encoding]" > + puts "moo: encoding stdout = [fconfigure stdout -encoding]" > +##--/moo > } > > # > ------------------------------------------------------------------------------ > Index: g_rtext.c > =================================================================== > --- g_rtext.c (revision 10779) > +++ g_rtext.c (working copy) > @@ -447,8 +447,13 @@ > > /* at Guenter's suggestion, use 'n>31' to test wither a character > might > be printable in whatever 8-bit character set we find ourselves. */ > +/*-- moo: ... but test with "<" rather than "!=" in order to > accomodate unicode codepoints for n > + (which we get since Tk is sending the "%A" substitution for > bind <Key>", > + effectively reducing the coverage of this clause to 7 bits; > case n>127 > + is covered by the next clause. > + --*/ > > - if (n == '\n' || (n > 31 && n != 127)) > + if (n == '\n' || (n > 31 /*&& n != 127*/ && n < 127)) /*-- > moo --*/ > { > newsize = x->x_bufsize+1; > x->x_buf = resizebytes(x->x_buf, x->x_bufsize, newsize); > @@ -457,7 +462,21 @@ > x->x_buf[x->x_selstart] = n; > x->x_bufsize = newsize; > x->x_selstart = x->x_selstart + 1; > + } > + /*--moo: check for 8-bit or unicode codepoints, assuming "keysym" > is a correctly encoded (UTF-8) string--*/ > + else if (n > 127) { > + int clen = strlen(keysym->s_name); > + newsize = x->x_bufsize + clen; > + x->x_buf = resizebytes(x->x_buf, x->x_bufsize, newsize); > + for (i = x->x_bufsize; i > x->x_selstart; i--) > + x->x_buf[i] = x->x_buf[i-1]; > + x->x_bufsize = newsize; > + /*-- insert keysym->s_name, rather than decoding the unicode > value here --*/ > + //strncpy(x->x_buf+x->x_selstart, keysym->s_name, clen); > + strcpy(x->x_buf+x->x_selstart, keysym->s_name); > + x->x_selstart = x->x_selstart + clen; > } > + /*--/moo--*/ > x->x_selend = x->x_selstart; > x->x_glist->gl_editor->e_textdirty = 1; > } ---------------------------------------------------------------------------- 'You people have such restrictive dress for women,’ she said, hobbling away in three inch heels and panty hose to finish out another pink- collar temp pool day. - “Hijab Scene #2", by Mohja Kahf _______________________________________________ [email protected] mailing list UNSUBSCRIBE and account-management -> http://lists.puredata.info/listinfo/pd-list
