On Feb 12, 2009, at 4:40 AM, Bryan Jurish wrote: > moin Hans, moin all, > > On 2009-02-12 06:24:44, Hans-Christoph Steiner <[email protected]> > appears to > have written: >> On Feb 11, 2009, at 6:34 AM, Bryan Jurish wrote: >>> for me, pd *does* display utf-8 >>> strings correctly in message boxes (tested with umlauts äöü, as >>> well as >>> Greek πδ >> >> Hmm, I am not sure that UTF-8 really is well supported. Some chars >> get >> thru, but many don't. Here's an example. I typed these chars in a >> UTF-8 text editor as an png and a pd patch. Not quite the same. > > ... I'm not really sure what (if anything) we can conclude from this. > Maybe the text editor is making UTF-8 out of the keyboard input? > The Pd > patch itself is most cetainly not UTF-8 encoded, which makes me > suspect > that either (a) Pd is dropping non-printing shift bytes (IOhannes has > pointed out similar goofiness in t_binbuf, but I thought it was only > restricted to NUL bytes) or (b) Tk isn't receiving UTF-8 character > codes > at all (whether this is Tk's fault or a system configuration issue is > another question). At least the latter should be testable with a few > quick wish hacks...
Pd does seem to measure the bytes of the string, measuring the UTF-8 shift bytes as chars. For exmaple, in barf-both.pd, the message box of the utf-8 example is much longer than the text inside, while with the latin1, it is the correct size. I don't know if you have followed Pd-devel 0.41.4 at all, but I have gotten to the point where the GUI is 100% Tcl/Tk so playing with this stuff should be a lot easier. Check out the branch, if you would like to try things. >>> Setting LC_CTYPE=en_US.UTF-8 and re-loading "unibarf.pd" got me an >>> odd >>> error message from Pd though: >>> >>> Pd: buffer space wasn't sufficient for long GUI string >>> (repeated 3 times) >> >> I am guessing that the above error comes from the fact that Pd is >> written for latin1 where every char is always 1 byte, so sending >> UTF-8 >> could confuse things, since UTF-8 can have multi-byte chars. > > Kinda; but why is it only the presence of *latin-1* message boxes that > cause complaints about "long GUI strings" (try deleting the utf-8 > message box & reloading: the error disappears). I think an error is > certainly justified in this case (we're feeding a latin-1 encoded > message box to a Pd using a UTF-8 locale); I was just surprised by the > form the error took ;-) I think that Tcl/Tk tries to guess the locale of the data coming in from the network socket, then translate it to UTF-8 and back. Some of the weirdness we are seeing could be related to that. In Pd-devel, its much clearer, so it would be straightforward to play with this encoding translation stuff, and perhaps turn it off. Ideally we could have UTF-8 coming from Pd so that Tk doesn't need to do any translation. That could speed up things like array/graph redrawing. >>> I don't know for sure, but I suspect one problem might be in the >>> interpretation of user input >> >> I don't know about the pd side, but Tcl/Tk is all UTF-8 natively, so >> that is no problem. > > Hmm... not sure what you mean by "natively" here... I mean, Perl uses > UTF-8 as its "native" string encoding, but you can still manipulate > byte > strings, read & write files etc in other encodings too. Yes, same idea. Internally, Tcl/Tk is using UTF-8, but it can freely translate between other encodings. > If we're > talking about user input and the Pd GUI, I think the main issue is how > keyboard input is captured by Tk and passed on to Pd. If the keyboard > input is being grabbed by Tk bind()ing KeyPress events, then maybe we > just need to edit that bind() call... looks like the KeyPress relevant > "%"-substitutions are (from the Tk bind() manpage): > > %k - The keycode field from the event. Valid only for KeyPress and > KeyRelease events. > > %A - Substitutes the UNICODE character corresponding to the event, or > the empty string if the event does not correspond to a UNICODE > character > (e.g. the shift key was pressed). XmbLookupString (or XLookupString > when > input method support is turned off) does all the work of translating > from the event to a UNICODE character. Valid only for KeyPress and > KeyRelease events. > > %K - The keysym corresponding to the event, substituted as a textual > string. Valid only for KeyPress and KeyRelease events. > > %N - The keysym corresponding to the event, substituted as a decimal > number. Valid only for KeyPress and KeyRelease events. > > ... so if we're lucky, we can just replace "%k" with "%A" and all will > be good... except for file I/O, which will likely still be done at a > raw > byte level. At this point, all "pure" latin-1 patches will proceed to > break (maybe just display problems, maybe more serious). If we say > we're going whole-hog utf-8, we can say that it's the user's problem > to > recode any such files (e.g. with iconv or recode; I'm happy to help > out > with a few scripts); otherwise we might want to do something paranoid > and try to guess a patch's encoding when it's loaded. Or we use > locale-dependent functions, but that makes sharing patches harder > between people using different locales. Or we use the XML-style > solution and just save the encoding to use in the patch header ;-) Yeah, this would be a good thing to rewrite. The canvas_key code is definitely in need of refactoring anyway. Pd has never really supported latin1 or any encoding besides ASCII, so I think we should just aim to make everything UTF-8, then make conversion utilities like you mentioned. >>> bash$ export LC_CTYPE=en_DK.UTF-8 >>> bash$ pd uselocale.pd barf-both.pd ##-- latin-1 displays >>> incorrectly >>> >>> bash$ export LC_CTYPE=en_DK.ISO-8859-1 >>> bash$ pd uselocale.pd barf-both.pd ##-- all displays ok >>> >>> If it turns out to work well, we can of course make a trivial >>> "dummy" >>> external out of it for use with "-lib" ... >> >> Hmm, I tried this on Mac OS X and it didn't seem to make a >> difference. >> Perhaps its a platform issue, though on this level, Mac OS X is very >> much BSD, so I think it should work. > > The locale strategy also depends on what locales your system has > installed. Here (linux/debian), I can see which locales are > installed with: > > bash$ locale -a > > ... I would expect goofiness trying to use "en_DK.UTF-8" if it's not > been installed ... I was using en_US.UTF-8. It seems to me that there is an extra dash in your locale. On Mac OS X, 'locale -a' tells me: en_US.ISO8859-1 On debian/stable, it tells me en_US.iso88591. Does every system have different names for the latin1? Arg.... I tried a bunch of variations of the locale and LANG and LC_CTYPE on Mac OS X, but I couldn't get the barf-both.pd to look different. .hc > > > marmosets, > Bryan > > -- > Bryan Jurish "There is *always* one more > bug." > [email protected] -Lubarsky's Law of Cybernetic > Entomology ---------------------------------------------------------------------------- As we enjoy great advantages from inventions of others, we should be glad of an opportunity to serve others by any invention of ours; and this we should do freely and generously. - Benjamin Franklin _______________________________________________ [email protected] mailing list UNSUBSCRIBE and account-management -> http://lists.puredata.info/listinfo/pd-list
