Re: [PD-dev] UTF-8 for pd-devel (again)

Miller Puckette Tue, 19 Jan 2010 21:32:20 -0800

I'm into the idea but want to see the GUI rewrite resolved first :)

M



On Tue, Jan 19, 2010 at 10:56:28PM -0500, Hans-Christoph Steiner wrote:
> 
> Miller, how about the UTF-8 patch?
> 
> .hc
> 
> On Jan 19, 2010, at 10:15 PM, Miller Puckette wrote:
> 
> >127 is 'delete' -- ascii all right, but not 'printable'.
> >
> >cheers
> >Miller
> >
> >On Tue, Jan 19, 2010 at 09:37:08PM -0500, Hans-Christoph Steiner  
> >wrote:
> >>
> >>Looks good to me. One comment, shouldn't this be n<128?  127 is an
> >>ASCII char, AFAIK.
> >>
> >>+        if (n == '\n' || (n > 31 && n < 127))
> >>
> >>It looks worth checking to me, hopefully we can get Miller and others
> >>to weigh in on it.
> >>
> >>.hc
> >>
> >>On Jan 19, 2010, at 4:16 PM, Bryan Jurish wrote:
> >>
> >>>morning all,
> >>>
> >>>attached is a UTF-8 support patch against branches/pd-gui-rewrite/ 
> >>>0.43
> >>>revision 13051 (HEAD as of an hour or so ago).  most of the bulk is
> >>>new
> >>>files (s_utf8.c, s_utf8.h), most other changes are in g_rtext.c.   
> >>>It's
> >>>not too monstrous, and I've tested it again here briefly with some
> >>>utf-8
> >>>test patches (see other attachment), and things appear to be working
> >>>as
> >>>expected.  if desired, I can check this in; otherwise feel free to
> >>>do it
> >>>for me ;-)
> >>>
> >>>2 annoying things here during testing (I don't see how my patches
> >>>could
> >>>have caused this, but you never know):
> >>>
> >>>(1) all loaded patch windows appear at +0+0 (upper left corner),  
> >>>which
> >>>with my wm (windowmaker) means the title bar is off the screen,  
> >>>and I
> >>>have to resort to keyboard shortcuts to get them mouse-draggable,
> >>>which
> >>>is a major pain in the wazoo: is this a known bug?
> >>>
> >>>(2) I can't figure out how to get at the properties dialog for  
> >>>number,
> >>>number2, or any other gui-atom objects: should these be working
> >>>already?
> >>>
> >>>marmosets,
> >>>   Bryan
> >>>
> >>>On 2010-01-18 23:09:34, Hans-Christoph Steiner <[email protected]>
> >>>appears to
> >>>have written:
> >>>>
> >>>>Awesome!  If its big and complicated, I say post it to the list
> >>>>first,
> >>>>if not too bad, then just commit.
> >>>>
> >>>>.hc
> >>>>
> >>>>On Jan 18, 2010, at 4:47 AM, Bryan Jurish wrote:
> >>>>
> >>>>>moin Hans, moin list,
> >>>>>
> >>>>>I think perhaps I never actually did post the cleaned-up patch
> >>>>>anywhere
> >>>>>(bad programmer, no biscuit);  I guess I'll check out
> >>>>>branches/pd-gui-rewrite/0.43 and try patching my changes in; then
> >>>>>I can
> >>>>>either commit or just post the (updated) patch.  Hopefully no  
> >>>>>major
> >>>>>additional changes will be required, so it ought to go pretty  
> >>>>>fast.
> >>>>>
> >>>>>marmosets,
> >>>>> Bryan
> >>>>>
> >>>>>On 2010-01-17 22:57:33, Hans-Christoph Steiner <[email protected]>
> >>>>>appears to
> >>>>>have written:
> >>>>>>
> >>>>>>Hey Bryan,
> >>>>>>
> >>>>>>I'd like to try to get your UTF-8 code into pd-gui-rewrite.  You
> >>>>>>mention
> >>>>>>in this posting back in May that you had the whole thing
> >>>>>>working.  I
> >>>>>>couldn't find the diff/patch for this.  Is it posted anywhere?
> >>>>>>Do you
> >>>>>>want to try to check it in yourself directly to the pd-gui-
> >>>>>>rewrite/0.43
> >>>>>>branch?
> >>>>>>
> >>>>>>.hc
> >>>>>>
> >>>>>>
> >>>>>>On Mar 20, 2009, at 6:16 PM, Bryan Jurish wrote:
> >>>>>>
> >>>>>>>morning all,
> >>>>>>>
> >>>>>>>Of course I never really like to see my code wither away in the
> >>>>>>>bit
> >>>>>>>bucket, but I personally don't have any pressing need for UTF-8
> >>>>>>>symbols,
> >>>>>>>comments, etc. in Pd -- I'm a native English speaker, after
> >>>>>>>all ;-)
> >>>>>>>
> >>>>>>>Also, my changes are by no means the only way to do it (or even
> >>>>>>>the
> >>>>>>>best
> >>>>>>>way); we could gain a little speed by slapping on some more
> >>>>>>>buffers
> >>>>>>>(mostly and possibly only in rtext_senditup()), but since this
> >>>>>>>seems to
> >>>>>>>effect only GUI/editing stuff, I think we can live with a
> >>>>>>>smidgeon of
> >>>>>>>additional cpu time ... after all, it's all O(n) anyways.
> >>>>>>>
> >>>>>>>Really I just wanted to see how easy (or difficult) it would be
> >>>>>>>to get
> >>>>>>>Pd to use UTF-8 as its internal encoding... turned out to be
> >>>>>>>harder
> >>>>>>>than
> >>>>>>>I had thought, but (ever so slightly) easier than I had  
> >>>>>>>feared :-/
> >>>>>>>
> >>>>>>>marmosets,
> >>>>>>>Bryan
> >>>>>>>
> >>>>>>>On 2009-03-20 18:39:06, Hans-Christoph Steiner <[email protected]>
> >>>>>>>appears to
> >>>>>>>have written:
> >>>>>>>>
> >>>>>>>>I wonder what the best approach is to getting it included.  I
> >>>>>>>>also
> >>>>>>>>think
> >>>>>>>>its a very valuable contribution.  I think we need to first get
> >>>>>>>>the
> >>>>>>>>Tcl/Tk only changes done, since that was the mandate of the pd-
> >>>>>>>>devel
> >>>>>>>>0.41 effort.  Then once Miller has accepted those changes, then
> >>>>>>>>we can
> >>>>>>>>start with the C modifications there.  So how to proceed next,
> >>>>>>>>I think
> >>>>>>>>is based on how eager you are, Bryan, to getting this in a
> >>>>>>>>regular
> >>>>>>>>build.
> >>>>>>>>
> >>>>>>>>One option is making a pd-devel-utf8 branch, another is posting
> >>>>>>>>these
> >>>>>>>>patches to the patch tracker and waiting for Miller to make his
> >>>>>>>>next
> >>>>>>>>update with the Pd-devel Tcl-Tk code.
> >>>>>>>>
> >>>>>>>>Maybe we can get Miller to chime in on this topic.
> >>>>>>>>
> >>>>>>>>.hc
> >>>>>>>>
> >>>>>>>>On Mar 13, 2009, at 12:00 AM, dmotd wrote:
> >>>>>>>>
> >>>>>>>>>hey bryan,
> >>>>>>>>>
> >>>>>>>>>just a quick note of a appreciation for getting this one out..
> >>>>>>>>>i hope
> >>>>>>>>>it gets
> >>>>>>>>>picked up in millers build soon.. a very useful and necessary
> >>>>>>>>>modification.
> >>>>>>>>>
> >>>>>>>>>well done!
> >>>>>>>>>
> >>>>>>>>>dmotd
> >>>>>>>>>
> >>>>>>>>>On Thursday 12 March 2009 08:07:50 Bryan Jurish wrote:
> >>>>>>>>>>moin folks,
> >>>>>>>>>>
> >>>>>>>>>>I believe I've finally got pd-devel 0.41-4 using UTF-8 across
> >>>>>>>>>>the
> >>>>>>>>>>board.
> >>>>>>>>>>So far, I've tested message boxes & comments (g_rtext), as
> >>>>>>>>>>well as
> >>>>>>>>>>symbol atoms, and all seems good.  I think we can still  
> >>>>>>>>>>expect
> >>>>>>>>>>goofiness
> >>>>>>>>>>if someone names an abstraction using a multibyte character
> >>>>>>>>>>when the
> >>>>>>>>>>filesystem isn't UTF-8 encoded (raw 8-bit works for me here
> >>>>>>>>>>too),
> >>>>>>>>>>but I
> >>>>>>>>>>really don't want to open that particular can of worms.
> >>>>>>>>>>
> >>>>>>>>>>So I guess I have 2 questions:
> >>>>>>>>>>
> >>>>>>>>>>(1) what should I call the generic UTF-8 source files? (see
> >>>>>>>>>>my other
> >>>>>>>>>>post)
> >>>>>>>>>>
> >>>>>>>>>>(2) shall I commit these changes to pd-devel/0.41-4, or
> >>>>>>>>>>somewhere
> >>>>>>>>>>else,
> >>>>>>>>>>or just post a diff (ca. 33k, ought to be easier to read now;
> >>>>>>>>>>I've
> >>>>>>>>>>tried
> >>>>>>>>>>to follow the indentation conventions of the source files I
> >>>>>>>>>>modified)?
> >>>>>>>>>>
> >>>>>>>>>>marmosets,
> >>>>>>>>>>Bryan
> >>>>>>>
> >>>>>>>-- 
> >>>>>>>Bryan Jurish                           "There is *always* one  
> >>>>>>>more
> >>>>>>>bug."
> >>>>>>>[email protected]      -Lubarsky's Law of Cybernetic
> >>>>>>>Entomology
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>----------------------------------------------------------------------------
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>The arc of history bends towards justice.     - Dr. Martin Luther
> >>>>>>King, Jr.
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>-- 
> >>>>>***************************************************
> >>>>>
> >>>>>Bryan Jurish
> >>>>>Deutsches Textarchiv
> >>>>>Berlin-Brandenburgische Akademie der Wissenschaften
> >>>>>
> >>>>>J?gerstr. 22/23
> >>>>>10117 Berlin
> >>>>>
> >>>>>Tel.:      +49 (0)30 20370 539
> >>>>>E-Mail:    [email protected]
> >>>>>
> >>>>>***************************************************
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>>----------------------------------------------------------------------------
> >>>>
> >>>>
> >>>>As we enjoy great advantages from inventions of others, we should  
> >>>>be
> >>>>glad of an opportunity to serve others by any invention of ours;  
> >>>>and
> >>>>this we should do freely and generously.         - Benjamin  
> >>>>Franklin
> >>>>
> >>>>
> >>>>
> >>>
> >>>-- 
> >>>Bryan Jurish                       "There is *always* one more bug."
> >>>[email protected]       -Lubarsky's Law of Cybernetic Entomology
> >>>Index: src/Makefile.am
> >>>===================================================================
> >>>--- src/Makefile.am        (revision 13051)
> >>>+++ src/Makefile.am        (working copy)
> >>>@@ -24,6 +24,7 @@
> >>>  m_conf.c m_glob.c m_sched.c \
> >>>  s_main.c s_inter.c s_file.c s_print.c \
> >>>  s_loader.c s_path.c s_entry.c s_audio.c s_midi.c \
> >>>+    s_utf8.c \
> >>>  d_ugen.c d_ctl.c d_arithmetic.c d_osc.c d_filter.c d_dac.c
> >>>d_misc.c \
> >>>  d_math.c d_fft.c d_array.c d_global.c \
> >>>  d_delay.c d_resample.c \
> >>>Index: src/g_editor.c
> >>>===================================================================
> >>>--- src/g_editor.c (revision 13051)
> >>>+++ src/g_editor.c (working copy)
> >>>@@ -9,6 +9,7 @@
> >>>#include "s_stuff.h"
> >>>#include "g_canvas.h"
> >>>#include <string.h>
> >>>+#include "s_utf8.h" /*-- moo --*/
> >>>
> >>>void glist_readfrombinbuf(t_glist *x, t_binbuf *b, char *filename,
> >>>  int selectem);
> >>>@@ -1666,8 +1667,9 @@
> >>>      gotkeysym = av[1].a_w.w_symbol;
> >>>  else if (av[1].a_type == A_FLOAT)
> >>>  {
> >>>-        char buf[3];
> >>>-        sprintf(buf, "%c", (int)(av[1].a_w.w_float));
> >>>+        /*-- moo: assume keynum is a Unicode codepoint; encode as
> >>>UTF-8 --*/
> >>>+        char buf[UTF8_MAXBYTES1];
> >>>+        u8_wc_toutf8_nul(buf, (UCS4)(av[1].a_w.w_float));
> >>>      gotkeysym = gensym(buf);
> >>>  }
> >>>  else gotkeysym = gensym("?");
> >>>Index: src/s_utf8.c
> >>>===================================================================
> >>>--- src/s_utf8.c   (revision 0)
> >>>+++ src/s_utf8.c   (revision 0)
> >>>@@ -0,0 +1,280 @@
> >>>+/*
> >>>+  Basic UTF-8 manipulation routines
> >>>+  by Jeff Bezanson
> >>>+  placed in the public domain Fall 2005
> >>>+
> >>>+  This code is designed to provide the utilities you need to
> >>>manipulate
> >>>+  UTF-8 as an internal string encoding. These functions do not
> >>>perform the
> >>>+  error checking normally needed when handling UTF-8 data, so if
> >>>you happen
> >>>+  to be from the Unicode Consortium you will want to flay me alive.
> >>>+  I do this because error checking can be performed at the
> >>>boundaries (I/O),
> >>>+  with these routines reserved for higher performance on data known
> >>>to be
> >>>+  valid.
> >>>+
> >>>+  modified by Bryan Jurish (moo) March 2009
> >>>+  + removed some unneeded functions (escapes, printf etc), added
> >>>others
> >>>+*/
> >>>+#include <stdlib.h>
> >>>+#include <stdio.h>
> >>>+#include <string.h>
> >>>+#include <stdarg.h>
> >>>+#ifdef WIN32
> >>>+#include <malloc.h>
> >>>+#else
> >>>+#include <alloca.h>
> >>>+#endif
> >>>+
> >>>+#include "s_utf8.h"
> >>>+
> >>>+static const u_int32_t offsetsFromUTF8[6] = {
> >>>+    0x00000000UL, 0x00003080UL, 0x000E2080UL,
> >>>+    0x03C82080UL, 0xFA082080UL, 0x82082080UL
> >>>+};
> >>>+
> >>>+static const char trailingBytesForUTF8[256] = {
> >>>+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  
> >>>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> >>>+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  
> >>>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> >>>+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  
> >>>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> >>>+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  
> >>>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> >>>+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  
> >>>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> >>>+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  
> >>>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> >>>+    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,  
> >>>1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
> >>>+    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,  
> >>>3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
> >>>+};
> >>>+
> >>>+
> >>>+/* returns length of next utf-8 sequence */
> >>>+int u8_seqlen(char *s)
> >>>+{
> >>>+    return trailingBytesForUTF8[(unsigned int)(unsigned char)s[0]]
> >>>+ 1;
> >>>+}
> >>>+
> >>>+/* conversions without error checking
> >>>+   only works for valid UTF-8, i.e. no 5- or 6-byte sequences
> >>>+   srcsz = source size in bytes, or -1 if 0-terminated
> >>>+   sz = dest size in # of wide characters
> >>>+
> >>>+   returns # characters converted
> >>>+   dest will always be L'\0'-terminated, even if there isn't enough
> >>>room
> >>>+   for all the characters.
> >>>+   if sz = srcsz+1 (i.e. 4*srcsz+4 bytes), there will always be
> >>>enough space.
> >>>+*/
> >>>+int u8_toucs(u_int32_t *dest, int sz, char *src, int srcsz)
> >>>+{
> >>>+    u_int32_t ch;
> >>>+    char *src_end = src + srcsz;
> >>>+    int nb;
> >>>+    int i=0;
> >>>+
> >>>+    while (i < sz-1) {
> >>>+        nb = trailingBytesForUTF8[(unsigned char)*src];
> >>>+        if (srcsz == -1) {
> >>>+            if (*src == 0)
> >>>+                goto done_toucs;
> >>>+        }
> >>>+        else {
> >>>+            if (src + nb >= src_end)
> >>>+                goto done_toucs;
> >>>+        }
> >>>+        ch = 0;
> >>>+        switch (nb) {
> >>>+            /* these fall through deliberately */
> >>>+#if UTF8_SUPPORT_FULL_UCS4
> >>>+        case 5: ch += (unsigned char)*src++; ch <<= 6;
> >>>+        case 4: ch += (unsigned char)*src++; ch <<= 6;
> >>>+#endif
> >>>+        case 3: ch += (unsigned char)*src++; ch <<= 6;
> >>>+        case 2: ch += (unsigned char)*src++; ch <<= 6;
> >>>+        case 1: ch += (unsigned char)*src++; ch <<= 6;
> >>>+        case 0: ch += (unsigned char)*src++;
> >>>+        }
> >>>+        ch -= offsetsFromUTF8[nb];
> >>>+        dest[i++] = ch;
> >>>+    }
> >>>+ done_toucs:
> >>>+    dest[i] = 0;
> >>>+    return i;
> >>>+}
> >>>+
> >>>+/* srcsz = number of source characters, or -1 if 0-terminated
> >>>+   sz = size of dest buffer in bytes
> >>>+
> >>>+   returns # characters converted
> >>>+   dest will only be '\0'-terminated if there is enough space. this
> >>>is
> >>>+   for consistency; imagine there are 2 bytes of space left, but
> >>>the next
> >>>+   character requires 3 bytes. in this case we could NUL-terminate,
> >>>but in
> >>>+   general we can't when there's insufficient space. therefore this
> >>>function
> >>>+   only NUL-terminates if all the characters fit, and there's space
> >>>for
> >>>+   the NUL as well.
> >>>+   the destination string will never be bigger than the source
> >>>string.
> >>>+*/
> >>>+int u8_toutf8(char *dest, int sz, u_int32_t *src, int srcsz)
> >>>+{
> >>>+    u_int32_t ch;
> >>>+    int i = 0;
> >>>+    char *dest_end = dest + sz;
> >>>+
> >>>+    while (srcsz<0 ? src[i]!=0 : i < srcsz) {
> >>>+        ch = src[i];
> >>>+        if (ch < 0x80) {
> >>>+            if (dest >= dest_end)
> >>>+                return i;
> >>>+            *dest++ = (char)ch;
> >>>+        }
> >>>+        else if (ch < 0x800) {
> >>>+            if (dest >= dest_end-1)
> >>>+                return i;
> >>>+            *dest++ = (ch>>6) | 0xC0;
> >>>+            *dest++ = (ch & 0x3F) | 0x80;
> >>>+        }
> >>>+        else if (ch < 0x10000) {
> >>>+            if (dest >= dest_end-2)
> >>>+                return i;
> >>>+            *dest++ = (ch>>12) | 0xE0;
> >>>+            *dest++ = ((ch>>6) & 0x3F) | 0x80;
> >>>+            *dest++ = (ch & 0x3F) | 0x80;
> >>>+        }
> >>>+        else if (ch < 0x110000) {
> >>>+            if (dest >= dest_end-3)
> >>>+                return i;
> >>>+            *dest++ = (ch>>18) | 0xF0;
> >>>+            *dest++ = ((ch>>12) & 0x3F) | 0x80;
> >>>+            *dest++ = ((ch>>6) & 0x3F) | 0x80;
> >>>+            *dest++ = (ch & 0x3F) | 0x80;
> >>>+        }
> >>>+        i++;
> >>>+    }
> >>>+    if (dest < dest_end)
> >>>+        *dest = '\0';
> >>>+    return i;
> >>>+}
> >>>+
> >>>+/* moo: get byte length of character number, or 0 if not  
> >>>supported */
> >>>+int u8_wc_nbytes(u_int32_t ch)
> >>>+{
> >>>+  if (ch < 0x80) return 1;
> >>>+  if (ch < 0x800) return 2;
> >>>+  if (ch < 0x10000) return 3;
> >>>+  if (ch < 0x200000) return 4;
> >>>+#if UTF8_SUPPORT_FULL_UCS4
> >>>+  /*-- moo: support full UCS-4 range? --*/
> >>>+  if (ch < 0x4000000) return 5;
> >>>+  if (ch < 0x7fffffffUL) return 6;
> >>>+#endif
> >>>+  return 0; /*-- bad input --*/
> >>>+}
> >>>+
> >>>+int u8_wc_toutf8(char *dest, u_int32_t ch)
> >>>+{
> >>>+    if (ch < 0x80) {
> >>>+        dest[0] = (char)ch;
> >>>+        return 1;
> >>>+    }
> >>>+    if (ch < 0x800) {
> >>>+        dest[0] = (ch>>6) | 0xC0;
> >>>+        dest[1] = (ch & 0x3F) | 0x80;
> >>>+        return 2;
> >>>+    }
> >>>+    if (ch < 0x10000) {
> >>>+        dest[0] = (ch>>12) | 0xE0;
> >>>+        dest[1] = ((ch>>6) & 0x3F) | 0x80;
> >>>+        dest[2] = (ch & 0x3F) | 0x80;
> >>>+        return 3;
> >>>+    }
> >>>+    if (ch < 0x110000) {
> >>>+        dest[0] = (ch>>18) | 0xF0;
> >>>+        dest[1] = ((ch>>12) & 0x3F) | 0x80;
> >>>+        dest[2] = ((ch>>6) & 0x3F) | 0x80;
> >>>+        dest[3] = (ch & 0x3F) | 0x80;
> >>>+        return 4;
> >>>+    }
> >>>+    return 0;
> >>>+}
> >>>+
> >>>+/*-- moo --*/
> >>>+int u8_wc_toutf8_nul(char *dest, u_int32_t ch)
> >>>+{
> >>>+  int sz = u8_wc_toutf8(dest,ch);
> >>>+  dest[sz] = '\0';
> >>>+  return sz;
> >>>+}
> >>>+
> >>>+/* charnum => byte offset */
> >>>+int u8_offset(char *str, int charnum)
> >>>+{
> >>>+    int offs=0;
> >>>+
> >>>+    while (charnum > 0 && str[offs]) {
> >>>+        (void)(isutf(str[++offs]) || isutf(str[++offs]) ||
> >>>+               isutf(str[++offs]) || ++offs);
> >>>+        charnum--;
> >>>+    }
> >>>+    return offs;
> >>>+}
> >>>+
> >>>+/* byte offset => charnum */
> >>>+int u8_charnum(char *s, int offset)
> >>>+{
> >>>+    int charnum = 0, offs=0;
> >>>+
> >>>+    while (offs < offset && s[offs]) {
> >>>+        (void)(isutf(s[++offs]) || isutf(s[++offs]) ||
> >>>+               isutf(s[++offs]) || ++offs);
> >>>+        charnum++;
> >>>+    }
> >>>+    return charnum;
> >>>+}
> >>>+
> >>>+/* reads the next utf-8 sequence out of a string, updating an index
> >>>*/
> >>>+u_int32_t u8_nextchar(char *s, int *i)
> >>>+{
> >>>+    u_int32_t ch = 0;
> >>>+    int sz = 0;
> >>>+
> >>>+    do {
> >>>+        ch <<= 6;
> >>>+        ch += (unsigned char)s[(*i)++];
> >>>+        sz++;
> >>>+    } while (s[*i] && !isutf(s[*i]));
> >>>+    ch -= offsetsFromUTF8[sz-1];
> >>>+
> >>>+    return ch;
> >>>+}
> >>>+
> >>>+/* number of characters */
> >>>+int u8_strlen(char *s)
> >>>+{
> >>>+    int count = 0;
> >>>+    int i = 0;
> >>>+
> >>>+    while (u8_nextchar(s, &i) != 0)
> >>>+        count++;
> >>>+
> >>>+    return count;
> >>>+}
> >>>+
> >>>+void u8_inc(char *s, int *i)
> >>>+{
> >>>+    (void)(isutf(s[++(*i)]) || isutf(s[++(*i)]) ||
> >>>+           isutf(s[++(*i)]) || ++(*i));
> >>>+}
> >>>+
> >>>+void u8_dec(char *s, int *i)
> >>>+{
> >>>+    (void)(isutf(s[--(*i)]) || isutf(s[--(*i)]) ||
> >>>+           isutf(s[--(*i)]) || --(*i));
> >>>+}
> >>>+
> >>>+/*-- moo --*/
> >>>+void u8_inc_ptr(char **sp)
> >>>+{
> >>>+  (void)(isutf(*(++(*sp))) || isutf(*(++(*sp))) ||
> >>>+   isutf(*(++(*sp))) || ++(*sp));
> >>>+}
> >>>+
> >>>+/*-- moo --*/
> >>>+void u8_dec_ptr(char **sp)
> >>>+{
> >>>+  (void)(isutf(*(--(*sp))) || isutf(*(--(*sp))) ||
> >>>+   isutf(*(--(*sp))) || --(*sp));
> >>>+}
> >>>Index: src/g_rtext.c
> >>>===================================================================
> >>>--- src/g_rtext.c  (revision 13051)
> >>>+++ src/g_rtext.c  (working copy)
> >>>@@ -13,6 +13,7 @@
> >>>#include "m_pd.h"
> >>>#include "s_stuff.h"
> >>>#include "g_canvas.h"
> >>>+#include "s_utf8.h"
> >>>
> >>>
> >>>#define LMARGIN 2
> >>>@@ -32,10 +33,10 @@
> >>>
> >>>struct _rtext
> >>>{
> >>>-    char *x_buf;
> >>>-    int x_bufsize;
> >>>-    int x_selstart;
> >>>-    int x_selend;
> >>>+    char *x_buf;    /*-- raw byte string, assumed UTF-8 encoded
> >>>(moo) --*/
> >>>+    int x_bufsize;  /*-- byte length --*/
> >>>+    int x_selstart; /*-- byte offset --*/
> >>>+    int x_selend;   /*-- byte offset --*/
> >>>  int x_active;
> >>>  int x_dragfrom;
> >>>  int x_height;
> >>>@@ -119,6 +120,15 @@
> >>>
> >>>/* LATER deal with tcl-significant characters */
> >>>
> >>>+/* firstone(), lastone()
> >>>+ *  + returns byte offset of (first|last) occurrence of 'c' in
> >>>'s[0..n-1]', or
> >>>+ *    -1 if none was found
> >>>+ *  + 's' is a raw byte string
> >>>+ *  + 'c' is a byte value
> >>>+ *  + 'n' is the length (in bytes) of the prefix of 's' to be
> >>>searched.
> >>>+ *  + we could make these functions work on logical characters in
> >>>utf8 strings,
> >>>+ *    but we don't really need to...
> >>>+ */
> >>>static int firstone(char *s, int c, int n)
> >>>{
> >>>  char *s2 = s + n;
> >>>@@ -155,6 +165,16 @@
> >>>  of the entire text in pixels.
> >>>  */
> >>>
> >>>+   /*-- moo:
> >>>+    * + some variables from the original version have been renamed
> >>>+    * + variables with a "_b" suffix are raw byte strings, lengths,
> >>>or offsets
> >>>+    * + variables with a "_c" suffix are logical character lengths
> >>>or offsets
> >>>+    *   (assuming valid UTF-8 encoded byte string in x->x_buf)
> >>>+    * + a fair amount of O(n) computations required to convert
> >>>between raw byte
> >>>+    *   offsets (needed by the C side) and logical character
> >>>offsets (needed by
> >>>+    *   the GUI)
> >>>+    */
> >>>+
> >>>  /* LATER get this and sys_vgui to work together properly,
> >>>      breaking up messages as needed.  As of now, there's
> >>>      a limit of 1950 characters, imposed by sys_vgui(). */
> >>>@@ -171,14 +191,16 @@
> >>>{
> >>>  t_float dispx, dispy;
> >>>  char smallbuf[200], *tempbuf;
> >>>-    int outchars = 0, nlines = 0, ncolumns = 0,
> >>>+    int outchars_b = 0, nlines = 0, ncolumns = 0,
> >>>      pixwide, pixhigh, font, fontwidth, fontheight, findx, findy;
> >>>  int reportedindex = 0;
> >>>  t_canvas *canvas = glist_getcanvas(x->x_glist);
> >>>-    int widthspec = x->x_text->te_width;
> >>>-    int widthlimit = (widthspec ? widthspec : BOXWIDTH);
> >>>-    int inindex = 0;
> >>>-    int selstart = 0, selend = 0;
> >>>+    int widthspec_c = x->x_text->te_width;
> >>>+    int widthlimit_c = (widthspec_c ? widthspec_c : BOXWIDTH);
> >>>+    int inindex_b = 0;
> >>>+    int inindex_c = 0;
> >>>+    int selstart_b = 0, selend_b = 0;
> >>>+    int x_bufsize_c = u8_charnum(x->x_buf, x->x_bufsize);
> >>>      /* if we're a GOP (the new, "goprect" style) borrow the font
> >>>size
> >>>      from the inside to preserve the spacing */
> >>>  if (pd_class(&x->x_text->te_pd) == canvas_class &&
> >>>@@ -193,65 +215,76 @@
> >>>  if (x->x_bufsize >= 100)
> >>>       tempbuf = (char *)t_getbytes(2 * x->x_bufsize + 1);
> >>>  else tempbuf = smallbuf;
> >>>-    while (x->x_bufsize - inindex > 0)
> >>>+    while (x_bufsize_c - inindex_c > 0)
> >>>  {
> >>>-        int inchars = x->x_bufsize - inindex;
> >>>-        int maxindex = (inchars > widthlimit ? widthlimit :  
> >>>inchars);
> >>>+        int inchars_b  = x->x_bufsize - inindex_b;
> >>>+        int inchars_c  = x_bufsize_c  - inindex_c;
> >>>+        int maxindex_c = (inchars_c > widthlimit_c ? widthlimit_c :
> >>>inchars_c);
> >>>+        int maxindex_b = u8_offset(x->x_buf + inindex_b,  
> >>>maxindex_c);
> >>>      int eatchar = 1;
> >>>-        int foundit = firstone(x->x_buf + inindex, '\n', maxindex);
> >>>-        if (foundit < 0)
> >>>+        int foundit_b  = firstone(x->x_buf + inindex_b, '\n',
> >>>maxindex_b);
> >>>+        int foundit_c;
> >>>+        if (foundit_b < 0)
> >>>      {
> >>>-            if (inchars > widthlimit)
> >>>+            if (inchars_c > widthlimit_c)
> >>>          {
> >>>-                foundit = lastone(x->x_buf + inindex, ' ',  
> >>>maxindex);
> >>>-                if (foundit < 0)
> >>>+                foundit_b = lastone(x->x_buf + inindex_b, ' ',
> >>>maxindex_b);
> >>>+                if (foundit_b < 0)
> >>>              {
> >>>-                    foundit = maxindex;
> >>>+                    foundit_b = maxindex_b;
> >>>+                    foundit_c = maxindex_c;
> >>>                  eatchar = 0;
> >>>              }
> >>>+                else
> >>>+                    foundit_c = u8_charnum(x->x_buf + inindex_b,
> >>>foundit_b);
> >>>          }
> >>>          else
> >>>          {
> >>>-                foundit = inchars;
> >>>+                foundit_b = inchars_b;
> >>>+                foundit_c = inchars_c;
> >>>              eatchar = 0;
> >>>          }
> >>>      }
> >>>+        else
> >>>+            foundit_c = u8_charnum(x->x_buf + inindex_b,  
> >>>foundit_b);
> >>>+
> >>>      if (nlines == findy)
> >>>      {
> >>>          int actualx = (findx < 0 ? 0 :
> >>>-                (findx > foundit ? foundit : findx));
> >>>-            *indexp = inindex + actualx;
> >>>+                (findx > foundit_c ? foundit_c : findx));
> >>>+            *indexp = inindex_b + u8_offset(x->x_buf + inindex_b,
> >>>actualx);
> >>>          reportedindex = 1;
> >>>      }
> >>>-        strncpy(tempbuf+outchars, x->x_buf + inindex, foundit);
> >>>-        if (x->x_selstart >= inindex &&
> >>>-            x->x_selstart <= inindex + foundit + eatchar)
> >>>-                selstart = x->x_selstart + outchars - inindex;
> >>>-        if (x->x_selend >= inindex &&
> >>>-            x->x_selend <= inindex + foundit + eatchar)
> >>>-                selend = x->x_selend + outchars - inindex;
> >>>-        outchars += foundit;
> >>>-        inindex += (foundit + eatchar);
> >>>-        if (inindex < x->x_bufsize)
> >>>-            tempbuf[outchars++] = '\n';
> >>>-        if (foundit > ncolumns)
> >>>-            ncolumns = foundit;
> >>>+        strncpy(tempbuf+outchars_b, x->x_buf + inindex_b,  
> >>>foundit_b);
> >>>+        if (x->x_selstart >= inindex_b &&
> >>>+            x->x_selstart <= inindex_b + foundit_b + eatchar)
> >>>+                selstart_b = x->x_selstart + outchars_b -  
> >>>inindex_b;
> >>>+        if (x->x_selend >= inindex_b &&
> >>>+            x->x_selend <= inindex_b + foundit_b + eatchar)
> >>>+                selend_b = x->x_selend + outchars_b - inindex_b;
> >>>+        outchars_b += foundit_b;
> >>>+        inindex_b += (foundit_b + eatchar);
> >>>+        inindex_c += (foundit_c + eatchar);
> >>>+        if (inindex_b < x->x_bufsize)
> >>>+            tempbuf[outchars_b++] = '\n';
> >>>+        if (foundit_c > ncolumns)
> >>>+            ncolumns = foundit_c;
> >>>      nlines++;
> >>>  }
> >>>  if (!reportedindex)
> >>>-        *indexp = outchars;
> >>>+        *indexp = outchars_b;
> >>>  dispx = text_xpix(x->x_text, x->x_glist);
> >>>  dispy = text_ypix(x->x_text, x->x_glist);
> >>>  if (nlines < 1) nlines = 1;
> >>>-    if (!widthspec)
> >>>+    if (!widthspec_c)
> >>>  {
> >>>      while (ncolumns < 3)
> >>>      {
> >>>-            tempbuf[outchars++] = ' ';
> >>>+            tempbuf[outchars_b++] = ' ';
> >>>          ncolumns++;
> >>>      }
> >>>  }
> >>>-    else ncolumns = widthspec;
> >>>+    else ncolumns = widthspec_c;
> >>>  pixwide = ncolumns * fontwidth + (LMARGIN + RMARGIN);
> >>>  pixhigh = nlines * fontheight + (TMARGIN + BMARGIN);
> >>>
> >>>@@ -259,31 +292,32 @@
> >>>      sys_vgui("pdtk_text_new .x%lx.c {%s %s text} %f %f {%.*s} %d
> >>>%s\n",
> >>>          canvas, x->x_tag, rtext_gettype(x)->s_name,
> >>>          dispx + LMARGIN, dispy + TMARGIN,
> >>>-            outchars, tempbuf, sys_hostfontsize(font),
> >>>+            outchars_b, tempbuf, sys_hostfontsize(font),
> >>>          (glist_isselected(x->x_glist,
> >>>              &x->x_glist->gl_gobj)? "blue" : "black"));
> >>>  else if (action == SEND_UPDATE)
> >>>  {
> >>>      sys_vgui("pdtk_text_set .x%lx.c %s {%.*s}\n",
> >>>-            canvas, x->x_tag, outchars, tempbuf);
> >>>+            canvas, x->x_tag, outchars_b, tempbuf);
> >>>      if (pixwide != x->x_drawnwidth || pixhigh != x->x_drawnheight)
> >>>          text_drawborder(x->x_text, x->x_glist, x->x_tag,
> >>>              pixwide, pixhigh, 0);
> >>>      if (x->x_active)
> >>>      {
> >>>-            if (selend > selstart)
> >>>+            if (selend_b > selstart_b)
> >>>          {
> >>>              sys_vgui(".x%lx.c select from %s %d\n", canvas,
> >>>-                    x->x_tag, selstart);
> >>>+                    x->x_tag, u8_charnum(x->x_buf, selstart_b));
> >>>              sys_vgui(".x%lx.c select to %s %d\n", canvas,
> >>>-                    x->x_tag, selend + (sys_oldtclversion ? 0 :  
> >>>-1));
> >>>+                    x->x_tag, u8_charnum(x->x_buf, selend_b)
> >>>+                        + (sys_oldtclversion ? 0 : -1));
> >>>              sys_vgui(".x%lx.c focus \"\"\n", canvas);
> >>>          }
> >>>          else
> >>>          {
> >>>              sys_vgui(".x%lx.c select clear\n", canvas);
> >>>              sys_vgui(".x%lx.c icursor %s %d\n", canvas, x->x_tag,
> >>>-                    selstart);
> >>>+                    u8_charnum(x->x_buf, selstart_b));
> >>>              sys_vgui(".x%lx.c focus %s\n", canvas, x->x_tag);
> >>>          }
> >>>      }
> >>>@@ -448,12 +482,12 @@
> >>>              ....
> >>>          } */
> >>>          if (x->x_selstart && (x->x_selstart == x->x_selend))
> >>>-                x->x_selstart--;
> >>>+                u8_dec(x->x_buf, &x->x_selstart);
> >>>      }
> >>>      else if (n == 127)      /* delete */
> >>>      {
> >>>          if (x->x_selend < x->x_bufsize && (x->x_selstart == x-
> >>>>x_selend))
> >>>-                x->x_selend++;
> >>>+                u8_inc(x->x_buf, &x->x_selend);
> >>>      }
> >>>
> >>>      ndel = x->x_selend - x->x_selstart;
> >>>@@ -466,7 +500,13 @@
> >>>/* at Guenter's suggestion, use 'n>31' to test wither a character
> >>>might
> >>>be printable in whatever 8-bit character set we find ourselves. */
> >>>
> >>>-        if (n == '\n' || (n > 31 && n != 127))
> >>>+/*-- moo:
> >>>+  ... but test with "<" rather than "!=" in order to accomodate
> >>>unicode
> >>>+  codepoints for n (which we get since Tk is sending the "%A"
> >>>substitution
> >>>+  for bind <Key>), effectively reducing the coverage of this clause
> >>>to 7
> >>>+  bits.  Case n>127 is covered by the next clause.
> >>>+*/
> >>>+        if (n == '\n' || (n > 31 && n < 127))
> >>>      {
> >>>          newsize = x->x_bufsize+1;
> >>>          x->x_buf = resizebytes(x->x_buf, x->x_bufsize, newsize);
> >>>@@ -476,20 +516,39 @@
> >>>          x->x_bufsize = newsize;
> >>>          x->x_selstart = x->x_selstart + 1;
> >>>      }
> >>>+  /*--moo: check for unicode codepoints beyond 7-bit ASCII --*/
> >>>+  else if (n > 127)
> >>>+        {
> >>>+            int ch_nbytes = u8_wc_nbytes(n);
> >>>+            newsize = x->x_bufsize + ch_nbytes;
> >>>+            x->x_buf = resizebytes(x->x_buf, x->x_bufsize,  
> >>>newsize);
> >>>+            for (i = x->x_bufsize; i > x->x_selstart; i--)
> >>>+                x->x_buf[i] = x->x_buf[i-1];
> >>>+            x->x_bufsize = newsize;
> >>>+            /*-- moo: assume canvas_key() has encoded keysym as
> >>>UTF-8 */
> >>>+            strncpy(x->x_buf+x->x_selstart, keysym->s_name,
> >>>ch_nbytes);
> >>>+            x->x_selstart = x->x_selstart + ch_nbytes;
> >>>+        }
> >>>      x->x_selend = x->x_selstart;
> >>>      x->x_glist->gl_editor->e_textdirty = 1;
> >>>  }
> >>>  else if (!strcmp(keysym->s_name, "Right"))
> >>>  {
> >>>      if (x->x_selend == x->x_selstart && x->x_selstart < x-
> >>>>x_bufsize)
> >>>-            x->x_selend = x->x_selstart = x->x_selstart + 1;
> >>>+        {
> >>>+            u8_inc(x->x_buf, &x->x_selstart);
> >>>+            x->x_selend = x->x_selstart;
> >>>+        }
> >>>      else
> >>>          x->x_selstart = x->x_selend;
> >>>  }
> >>>  else if (!strcmp(keysym->s_name, "Left"))
> >>>  {
> >>>      if (x->x_selend == x->x_selstart && x->x_selstart > 0)
> >>>-            x->x_selend = x->x_selstart = x->x_selstart - 1;
> >>>+        {
> >>>+            u8_dec(x->x_buf, &x->x_selstart);
> >>>+            x->x_selend = x->x_selstart;
> >>>+        }
> >>>      else
> >>>          x->x_selend = x->x_selstart;
> >>>  }
> >>>@@ -497,18 +556,18 @@
> >>>  else if (!strcmp(keysym->s_name, "Up"))
> >>>  {
> >>>      if (x->x_selstart)
> >>>-            x->x_selstart--;
> >>>+            u8_dec(x->x_buf, &x->x_selstart);
> >>>      while (x->x_selstart > 0 && x->x_buf[x->x_selstart] != '\n')
> >>>-            x->x_selstart--;
> >>>+            u8_dec(x->x_buf, &x->x_selstart);
> >>>      x->x_selend = x->x_selstart;
> >>>  }
> >>>  else if (!strcmp(keysym->s_name, "Down"))
> >>>  {
> >>>      while (x->x_selend < x->x_bufsize &&
> >>>          x->x_buf[x->x_selend] != '\n')
> >>>-            x->x_selend++;
> >>>+            u8_inc(x->x_buf, &x->x_selend);
> >>>      if (x->x_selend < x->x_bufsize)
> >>>-            x->x_selend++;
> >>>+            u8_inc(x->x_buf, &x->x_selend);
> >>>      x->x_selstart = x->x_selend;
> >>>  }
> >>>  rtext_senditup(x, SEND_UPDATE, &w, &h, &indx);
> >>>Index: src/s_utf8.h
> >>>===================================================================
> >>>--- src/s_utf8.h   (revision 0)
> >>>+++ src/s_utf8.h   (revision 0)
> >>>@@ -0,0 +1,88 @@
> >>>+#ifndef S_UTF8_H
> >>>+#define S_UTF8_H
> >>>+
> >>>+/*--moo--*/
> >>>+#ifndef u_int32_t
> >>>+# define u_int32_t unsigned int
> >>>+#endif
> >>>+
> >>>+#ifndef UCS4
> >>>+# define UCS4 u_int32_t
> >>>+#endif
> >>>+
> >>>+/* UTF8_SUPPORT_FULL_UCS4
> >>>+ *  define this to support the full potential range of UCS-4
> >>>codepoints
> >>>+ *  (in anticipation of a future UTF-8 standard)
> >>>+ */
> >>>+/*#define UTF8_SUPPORT_FULL_UCS4 1*/
> >>>+#undef UTF8_SUPPORT_FULL_UCS4
> >>>+
> >>>+/* UTF8_MAXBYTES
> >>>+ *   maximum number of bytes required to represent a single
> >>>character in UTF-8
> >>>+ *
> >>>+ * UTF8_MAXBYTES1 = UTF8_MAXBYTES+1
> >>>+ *  maximum bytes per character including NUL terminator
> >>>+ */
> >>>+#ifdef UTF8_SUPPORT_FULL_UCS4
> >>>+# ifndef UTF8_MAXBYTES
> >>>+#  define UTF8_MAXBYTES  6
> >>>+# endif
> >>>+# ifndef UTF8_MAXBYTES1
> >>>+#  define UTF8_MAXBYTES1 7
> >>>+# endif
> >>>+#else
> >>>+# ifndef UTF8_MAXBYTES
> >>>+#  define UTF8_MAXBYTES  4
> >>>+# endif
> >>>+# ifndef UTF8_MAXBYTES1
> >>>+#  define UTF8_MAXBYTES1 5
> >>>+# endif
> >>>+#endif
> >>>+/*--/moo--*/
> >>>+
> >>>+/* is c the start of a utf8 sequence? */
> >>>+#define isutf(c) (((c)&0xC0)!=0x80)
> >>>+
> >>>+/* convert UTF-8 data to wide character */
> >>>+int u8_toucs(u_int32_t *dest, int sz, char *src, int srcsz);
> >>>+
> >>>+/* the opposite conversion */
> >>>+int u8_toutf8(char *dest, int sz, u_int32_t *src, int srcsz);
> >>>+
> >>>+/* moo: get byte length of character number, or 0 if not  
> >>>supported */
> >>>+int u8_wc_nbytes(u_int32_t ch);
> >>>+
> >>>+/* moo: compute required storage for UTF-8 encoding of  
> >>>'s[0..n-1]' */
> >>>+int u8_wcs_nbytes(u_int32_t *ucs, int size);
> >>>+
> >>>+/* single character to UTF-8, no NUL termination */
> >>>+int u8_wc_toutf8(char *dest, u_int32_t ch);
> >>>+
> >>>+/* moo: single character to UTF-8, with NUL termination */
> >>>+int u8_wc_toutf8_nul(char *dest, u_int32_t ch);
> >>>+
> >>>+/* character number to byte offset */
> >>>+int u8_offset(char *str, int charnum);
> >>>+
> >>>+/* byte offset to character number */
> >>>+int u8_charnum(char *s, int offset);
> >>>+
> >>>+/* return next character, updating an index variable */
> >>>+u_int32_t u8_nextchar(char *s, int *i);
> >>>+
> >>>+/* move to next character */
> >>>+void u8_inc(char *s, int *i);
> >>>+
> >>>+/* move to previous character */
> >>>+void u8_dec(char *s, int *i);
> >>>+
> >>>+/* moo: move pointer to next character */
> >>>+void u8_inc_ptr(char **sp);
> >>>+
> >>>+/* moo: move pointer to previous character */
> >>>+void u8_dec_ptr(char **sp);
> >>>+
> >>>+/* returns length of next utf-8 sequence */
> >>>+int u8_seqlen(char *s);
> >>>+
> >>>+#endif /* S_UTF8_H */
> >>><test-utf8.pd>
> >>
> >>
> >>
> >>
> >>
> >>----------------------------------------------------------------------------
> >>
> >>"[T]he greatest purveyor of violence in the world today [is] my own
> >>government." - Martin Luther King, Jr.
> >>
> >>
> >>
> >>
> >>_______________________________________________
> >>Pd-dev mailing list
> >>[email protected]
> >>http://lists.puredata.info/listinfo/pd-dev
> >
> >_______________________________________________
> >Pd-dev mailing list
> >[email protected]
> >http://lists.puredata.info/listinfo/pd-dev
> 
> 
> 
> ----------------------------------------------------------------------------
> 
> Man has survived hitherto because he was too ignorant to know how to  
> realize his wishes.  Now that he can realize them, he must either  
> change them, or perish.    -William Carlos Williams
> 
> 
> 
> _______________________________________________
> Pd-dev mailing list
> [email protected]
> http://lists.puredata.info/listinfo/pd-dev

_______________________________________________
Pd-dev mailing list
[email protected]
http://lists.puredata.info/listinfo/pd-dev

Re: [PD-dev] UTF-8 for pd-devel (again)

Reply via email to