Re: internationalizing syntax files

Ben Pfaff Fri, 16 Jun 2006 07:58:08 -0700

John Darrington <[EMAIL PROTECTED]> writes:

> On Thu, Jun 15, 2006 at 10:29:36AM -0700, Ben Pfaff wrote:
>      OK, stipulate for the moment that we decide to move to wide
>      characters and strings for syntax file.  The biggest issue in my
>      mind is, then, deciding how many assumptions we want to make
>      about wchar_t.  There are several levels.  In rough order of
>      increasingly strong assumptions:
>      
>              1. Don't make any assumptions.  There is no benefit to
>                 this above using "char", because C99 doesn't actually
>                 say that wide strings can't have stateful or
>                 multi-unit encodings.  It also doesn't say that the
>                 encoding of wchar_t is locale-independent.
>      
>              2. Assume that wchar_t has a stateless encoding.
>      
>              3. Assume that wchar_t has a stateless and
>                 locale-independent encoding.
>      
>              4. Assume that wchar_t is Unicode (one of UCS-2, UTF-16,
>                 UTF-32), and for UTF-16 ignore the possibility of
>                 surrogate pairs.  C99 recommends but does not require
>                 use of Unicode for wchar_t.  (There's a standard macro
>                 __STDC_ISO_10646__ that indicates this.)
>      
>              5. Assume that wchar_t is UTF-32.
>      
>      GCC and glibc conform to level 5.  Native Windows conforms to
>      level 4.
>      
>
> In the above, I'm assuming that when you say "wchar_t has a stateless
> encoding", you mean that the entity reading the stream is
> stateless. wchar_t is (on my machine at least) just a typedef to int,
> so can't contain any "state" except its face value.


A stateful encoding is one that needs potentially unbounded
look-behind to interpret.  For example, ISO-2022 has escape
sequences that change the interpretation of all following bytes.
So, yes, it's the reader of a stateful encoding that needs to
maintain the state, but it's still a fairly common jargon term
despite the minor misnaming.

> So, that being so, I don't think we need to make any assumptions
> beyond level 3. See below for elaboration:
>
>      I'm saying that we can't blindly translate syntax files to UTF-8
>      or UTF-32 unless we also translate all of the string and
>      character literals that we use in conjunction with them to UTF-8
>      or UTF-32 also.  If the execution character set is Unicode, then
>      no translation is needed; otherwise, we'd have to call a function
>      to do that, which is inconvenient and relatively slow.
>
> Surely, the string and character literals are converted to UTF-32 by the
> compiler?  Just by saying:
>
> const wchar_t str[] = L"foo";
>
> then str contains a UTF-32 (or whatever the wchar_t encoding for that
> platform happens to be).  

It definitely contains a wchar_t encoding for the "C" locale.  If
we make the assumption of a locale-independent encoding for
wchar_t, then it contains a wchar_t encoding of the string for
the current locale too.

> Currently, syntax is read one line at a time, using ds_read_line from
> str.c.  The way I see it working, is that a wchar_t counterpart to
> str.c is created (call it wstr.c). In dws_read_line, the call to
> getc(stream) is replaced by getwc(stream).  
[...]
> This "reasonable" expectation seems to be a statement of  your
> assumption #3 above. 

Yes, that's the best way to do it.

> So, let us assume that I'm running PSPP on a machine whose wchar_t
> happens to be UTF-32 encoded, and it's native charset is EBCDIC.  So
> long as my LC_CTYPE encoding specifies EDCDIC, syntax files will be
> dutifully converted to UTF-32, and during parsing, compared with
> UTF-32 string constants.  If, I'm provided with a syntax file, which
> is encoded in UTF-8, I can use this file, simply by changing LANG (or
> LC_CTYPE) to en_AU.UTF-8 (or similar).

Yes.

>      > Similarly I cannot conceive that there would be many platforms today
>      > that have a sizeof(wchar_t) of 16 bits.  If it does, let's just issue
>      > a warning at configure time.
>      
>      The elephant in the room here is Windows.  If we ever want to
>      have native Windows support, its wchar_t is 16 bits and that's
>      unlikely to change as I understand it.
>      
> I'm treading outside the bounds of my understanding of unicode
> now. But I read a bit of the web site, and from what I can infer,
> almost all the glyphs for modern natural languages are located below
> 65365.  The "code points" above that are for ancient languages and
> math symbols etc.

It seems to depend on who you ask.  I've seen claims that some of
the high-plane code points are important, and I've seen claims of
the opposite.

OK, now I have some higher-level i18n issues to raise, so stay
tuned :-)
-- 
"...I've forgotten where I was going with this,
 but you can bet it was scathing."
--DesiredUsername


_______________________________________________
pspp-dev mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/pspp-dev

Re: internationalizing syntax files

Reply via email to