On Feb 19, 2009, at 4:13 PM, Bryan Jurish wrote: > moin Hans, moin list, > > On 2009-02-19 18:43:49, Hans-Christoph Steiner <[email protected]> > appears to > have written: >> >> This is good news! While the C changes aren't dead simple, they >> are not >> bad. I think they could be slightly simplified. One thing that >> would >> make it much easier to read the diff is if you create it without >> whitespace changes. So like this: >> >> svn diff -x -w > > oops, sorry... duly noted for future diffs ... I also set my emacs' > tcl-indent-width to 8 ... sorry sorry sorry ... > >> As for the Tcl changes, I think we can include those now in Pd- >> devel, as >> long they can work ok with unchanged C code. > > Done. > >> Then once the new Tcl GUI >> is included we can refactor the C side of things with things like >> this. > >> One other thing, it seems that the ASCII char are handled differently >> than the UTF-8 chars in g_rtext.c, I think you could use instead >> wcswidth(), mbstowcs() or other UTF-8 functions as described in the >> UTF-8 FAQ >> >> http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod > > Certainly, but (A) we already have the UTF-8 byte string in keysym, > and > we need to append that whole string to the buffer anyways, and (B) > using wcswidth() & co requires forcing the locale to have a UTF-8 > LC_CTYPE. I know I did this in m_pd.c, but I think that was a HACK > and > that using locale functions here is the Wrong Way To Do It, because > it's > dangerous, unportable, and slow (warning: rant follows): > > __dangerous__: setting the locale is global for all threads of a > process; in forcing the locale, we could conceivably mess with > desired > behavior elsewhere (e.g. in externals). > > __unportable__: we don't even know if all users' machines *have* a > UTF-8 > locale installed, and even if they do, we don't know what it's called. > If we don't force the encoding, we're stuck with either "C" (e.g. > ASCII; > what we've got now in Pd-vanilla), or whatever the user is currently > employing (after setlocale(LC_ALL,"")), which makes patches' > appearance > dependent on the user's encoding (e.g. what we've got now in > Pd-vanilla), and doesn't even work in the case of variable-length > encodings such as UTF-8. > > __slow__: many locale-based conversion functions are known to be > pretty > darned slow. if we assume we're always dealing with (valid) UTF-8, we > can speed things up considerably. going straight to wchar_t is > another > option, but would require many more changes on the C side, likely > break > the C API, and wouldn't solve the locale-dependency of patches' > appearances, which I think is a really good argument for UTF-8.
Isn't it pretty safe to assume these days that UTF-8 is supported? One thing I just found out is that Windows uses a 2-byte char natively (UCS-2?), I think Mac OS X uses UTF-8 natively. I think that most Linux tools should work with UTF-8 too, especially since it can work as ASCII. So you think we can have full UTF-8 support without using those functions? > (rant finished now, sorry) > > That said, a faster implementation would probably result from mixing > (something like) wcswidth() and strncpy(...,keysym). Functions like > wcswidth() and mbstowcs() are pretty easy to cook up if we assume > wchar_t is UCS-4 and the multibyte encoding is UTF-8. It seems to me that the wcswidth() would be used for measuring the length of the text for display in boxes. I suppose strlen() could still be used for allocating and freeing memory, but I think that we should aim for clean code. If you think the current way in your diff is the best, that's fine by me. > There are a > number of libraries and code snippets floating about in the net making > just such assumptions. In this context: are there any licensing > restrictions on code included in pd-devel? So far, I've found one > useful-looking (.c,.h) pair in the public domain, as well as some LGPL > code from gnulib, which could be linked in statically. There's also > code from the Unicode Consortium themselves, but it's pretty monstrous > (read "pedantic") and limited to string-to-string conversions. Well, Pd-vanilla is BSD licensed, and Pd-extended is GPL'ed. For this stage of Pd-devel, it would be good to keep it to something that can be BSD licensed. .hc > > > marmosets, > Bryan > >> On Feb 17, 2009, at 5:53 PM, Bryan Jurish wrote: >> >>> So I've tried to get the pd-devel 0.41.4 branch to use UTF-8 >>> across the >>> board. The TK side was easy (as Hans predicted); > [snip] >>> The C side is much hairier. > [snip] > > -- > Bryan Jurish "There is *always* one more > bug." > [email protected] -Lubarsky's Law of Cybernetic > Entomology ---------------------------------------------------------------------------- Access to computers should be unlimited and total. - the hacker ethic _______________________________________________ [email protected] mailing list UNSUBSCRIBE and account-management -> http://lists.puredata.info/listinfo/pd-list
