Re: [PD] locales for Pd WAS: japanese encoded chars in PD

Hans-Christoph Steiner Thu, 12 Feb 2009 18:17:45 -0800

On Thu, 12 Feb 2009, Bryan Jurish wrote:

morning all,


On 2009-02-12 20:22:22, Hans-Christoph Steiner <[email protected]> appears to
have written:

On 2009-02-12 06:24:44, Hans-Christoph Steiner <[email protected]> appears to
have written:

On Feb 11, 2009, at 6:34 AM, Bryan Jurish wrote:

for me, pd *does* display utf-8
strings correctly in message boxes (tested with umlauts äöü, as well as
Greek &pi;&delta;


Hmm, I am not sure that UTF-8 really is well supported.  Some chars get
thru, but many don't.  Here's an example.  I typed these chars in a
UTF-8 text editor as an png and a pd patch.  Not quite the same.


... I'm not really sure what (if anything) we can conclude from this.
Maybe the text editor is making UTF-8 out of the keyboard input?  The Pd
patch itself is most cetainly not UTF-8 encoded, which makes me suspect
that either (a) Pd is dropping non-printing shift bytes (IOhannes has
pointed out similar goofiness in t_binbuf, but I thought it was only
restricted to NUL bytes) or (b) Tk isn't receiving UTF-8 character codes
at all (whether this is Tk's fault or a system configuration issue is
another question).  At least the latter should be testable with a few
quick wish hacks...


Pd does seem to measure the bytes of the string, measuring the UTF-8
shift bytes as chars.  For exmaple, in barf-both.pd, the message box of
the utf-8 example is much longer than the text inside, while with the
latin1, it is the correct size.


yup.

I don't know if you have followed Pd-devel 0.41.4 at all, but I have
gotten to the point where the GUI is 100% Tcl/Tk so playing with this
stuff should be a lot easier.  Check out the branch, if you would like
to try things.


soon.

Setting LC_CTYPE=en_US.UTF-8 and re-loading "unibarf.pd" got me an odd
error message from Pd though:

Pd: buffer space wasn't sufficient for long GUI string
(repeated 3 times)


I am guessing that the above error comes from the fact that Pd is
written for latin1 where every char is always 1 byte, so sending UTF-8
could confuse things, since UTF-8 can have multi-byte chars.


Kinda; but why is it only the presence of *latin-1* message boxes that
cause complaints about "long GUI strings" (try deleting the utf-8
message box & reloading: the error disappears).  I think an error is
certainly justified in this case (we're feeding a latin-1 encoded
message box to a Pd using a UTF-8 locale); I was just surprised by the
form the error took ;-)


I think that Tcl/Tk tries to guess the locale of the data coming in from
the network socket, then translate it to UTF-8 and back.  Some of the
weirdness we are seeing could be related to that.  In Pd-devel, its much
clearer, so it would be straightforward to play with this encoding
translation stuff, and perhaps turn it off.  Ideally we could have UTF-8
coming from Pd so that Tk doesn't need to do any translation.  That
could speed up things like array/graph redrawing.


Are we certain that Tk is actually translating at all, and not just
using some 8-bit default like latin-1 when it finds non-UTF-8 input?  I
ask because that's what Perl does by default, a behavior which continues
to give me headaches.  In Perl, each string has its own internal "utf8"
flag which tells you whether Perl is currently thinking of that string
as a raw byte-string in some unknown encoding or as a "native" (utf8)
character string... I assume Tcl/Tk does something similar, but don't
know how to test for this property there.

Here's the doc that I read on this topic, but it probably doesn't have thelvel of detail that you require:


http://tcl.tk/man/tcl8.5/TclCmd/fconfigure.htm#M8

As for Tk hacking for Pd, a big part of the pd-devel effort is to make theTk GUI code readable, and even extendable! Feel free to hit me withquestions, either here, or I am in #dataflow quite a bit these days.

.hc

I don't know for sure, but I suspect one problem might be in the
interpretation of user input


I don't know about the pd side, but Tcl/Tk is all UTF-8 natively, so
that is no problem.


Hmm... not sure what you mean by "natively" here... I mean, Perl uses
UTF-8 as its "native" string encoding, but you can still manipulate byte
strings, read & write files etc in other encodings too.


Yes, same idea.  Internally, Tcl/Tk is using UTF-8, but it can freely
translate between other encodings.


see above.

If we're
talking about user input and the Pd GUI, I think the main issue is how
keyboard input is captured by Tk and passed on to Pd.  If the keyboard
input is being grabbed by Tk bind()ing KeyPress events, then maybe we
just need to edit that bind() call... looks like the KeyPress relevant
"%"-substitutions are (from the Tk bind() manpage):

[snip]

... I'm curious enough to try these out now... just have to dust off my
long unused Tcl/Tk skills a bit ;-)

... so if we're lucky, we can just replace "%k" with "%A" and all will
be good... except for file I/O, which will likely still be done at a raw
byte level.  At this point, all "pure" latin-1 patches will proceed to
break (maybe just display problems, maybe more serious).  If we say
we're going whole-hog utf-8, we can say that it's the user's problem to
recode any such files (e.g. with iconv or recode; I'm happy to help out
with a few scripts); otherwise we might want to do something paranoid
and try to guess a patch's encoding when it's loaded.  Or we use
locale-dependent functions, but that makes sharing patches harder
between people using different locales.  Or we use the XML-style
solution and just save the encoding to use in the patch header ;-)


Yeah, this would be a good thing to rewrite.  The canvas_key code is
definitely in need of refactoring anyway.  Pd has never really supported
latin1 or any encoding besides ASCII, so I think we should just aim to
make everything UTF-8, then make conversion utilities like you mentioned.


I'll have a look, but always in the past I've been scared off whenever
I've tried to look deeper into Pd's Tk side.

bash$ export LC_CTYPE=en_DK.UTF-8
bash$ pd uselocale.pd barf-both.pd   ##-- latin-1 displays incorrectly

bash$ export LC_CTYPE=en_DK.ISO-8859-1
bash$ pd uselocale.pd barf-both.pd   ##-- all displays ok

If it turns out to work well, we can of course make a trivial "dummy"
external out of it for use with "-lib" ...


Hmm, I tried this on Mac OS X and it didn't seem to make a difference.
Perhaps its a platform issue, though on this level, Mac OS X is very
much BSD, so I think it should work.


The locale strategy also depends on what locales your system has
installed.  Here (linux/debian), I can see which locales are installed
with:

  bash$ locale -a

... I would expect goofiness trying to use "en_DK.UTF-8" if it's not
been installed ...


I was using en_US.UTF-8.  It seems to me that there is an extra dash in
your locale.  On Mac OS X, 'locale -a' tells me: en_US.ISO8859-1  On
debian/stable, it tells me en_US.iso88591.  Does every system have
different names for the latin1?  Arg....  I tried a bunch of variations
of the locale and LANG and LC_CTYPE on Mac OS X, but I couldn't get the
barf-both.pd to look different.


curioser and curioser.  I think on debian both "iso88591" and
"ISO-8859-1" should work as charmaps.  Similary, both "utf8" and "UTF-8"
ought to work.  The locale(1) manpage says:

 FILES
   /usr/share/i18n/SUPPORTED
       List of supported values (and their associated encoding) for
       the locale name.  This representation is recommended over
       --all-locales one, due being the system wide supported values.

... /usr/share/i18n/SUPPORTED (and /etc/locale.gen) includes for example
"ISO-8859-1", but not "iso88591".  `locale -a` on the other hand outputs
"iso88591" but not "ISO-8859-1".  I'm not sure whether the relevant
standard (ISO/IEC 9945 aka POSIX?) says anything about the form that
charmap names have to take.  Looking at
http://faqs.cs.uu.nl/na-dir/internationalization/iso-8859-1-charset.html,
I find:

 "Currently, each system vendor has his own set of locale names, which
makes portability a bit problematic."

Bummer.

marmosets,
        Bryan
--
Bryan Jurish                           "There is *always* one more bug."
[email protected]      -Lubarsky's Law of Cybernetic Entomology


        zen
           \
            \
             \[D[D[D[D

_______________________________________________
[email protected] mailing list
UNSUBSCRIBE and account-management -> 
http://lists.puredata.info/listinfo/pd-list

Re: [PD] locales for Pd WAS: japanese encoded chars in PD

Reply via email to