John Darrington <[EMAIL PROTECTED]> writes: > On Thu, Jun 15, 2006 at 10:29:36AM -0700, Ben Pfaff wrote: > OK, stipulate for the moment that we decide to move to wide > characters and strings for syntax file. The biggest issue in my > mind is, then, deciding how many assumptions we want to make > about wchar_t. There are several levels. In rough order of > increasingly strong assumptions: > > 1. Don't make any assumptions. There is no benefit to > this above using "char", because C99 doesn't actually > say that wide strings can't have stateful or > multi-unit encodings. It also doesn't say that the > encoding of wchar_t is locale-independent. > > 2. Assume that wchar_t has a stateless encoding. > > 3. Assume that wchar_t has a stateless and > locale-independent encoding. > > 4. Assume that wchar_t is Unicode (one of UCS-2, UTF-16, > UTF-32), and for UTF-16 ignore the possibility of > surrogate pairs. C99 recommends but does not require > use of Unicode for wchar_t. (There's a standard macro > __STDC_ISO_10646__ that indicates this.) > > 5. Assume that wchar_t is UTF-32. > > GCC and glibc conform to level 5. Native Windows conforms to > level 4. > > > In the above, I'm assuming that when you say "wchar_t has a stateless > encoding", you mean that the entity reading the stream is > stateless. wchar_t is (on my machine at least) just a typedef to int, > so can't contain any "state" except its face value.
A stateful encoding is one that needs potentially unbounded look-behind to interpret. For example, ISO-2022 has escape sequences that change the interpretation of all following bytes. So, yes, it's the reader of a stateful encoding that needs to maintain the state, but it's still a fairly common jargon term despite the minor misnaming. > So, that being so, I don't think we need to make any assumptions > beyond level 3. See below for elaboration: > > I'm saying that we can't blindly translate syntax files to UTF-8 > or UTF-32 unless we also translate all of the string and > character literals that we use in conjunction with them to UTF-8 > or UTF-32 also. If the execution character set is Unicode, then > no translation is needed; otherwise, we'd have to call a function > to do that, which is inconvenient and relatively slow. > > Surely, the string and character literals are converted to UTF-32 by the > compiler? Just by saying: > > const wchar_t str[] = L"foo"; > > then str contains a UTF-32 (or whatever the wchar_t encoding for that > platform happens to be). It definitely contains a wchar_t encoding for the "C" locale. If we make the assumption of a locale-independent encoding for wchar_t, then it contains a wchar_t encoding of the string for the current locale too. > Currently, syntax is read one line at a time, using ds_read_line from > str.c. The way I see it working, is that a wchar_t counterpart to > str.c is created (call it wstr.c). In dws_read_line, the call to > getc(stream) is replaced by getwc(stream). [...] > This "reasonable" expectation seems to be a statement of your > assumption #3 above. Yes, that's the best way to do it. > So, let us assume that I'm running PSPP on a machine whose wchar_t > happens to be UTF-32 encoded, and it's native charset is EBCDIC. So > long as my LC_CTYPE encoding specifies EDCDIC, syntax files will be > dutifully converted to UTF-32, and during parsing, compared with > UTF-32 string constants. If, I'm provided with a syntax file, which > is encoded in UTF-8, I can use this file, simply by changing LANG (or > LC_CTYPE) to en_AU.UTF-8 (or similar). Yes. > > Similarly I cannot conceive that there would be many platforms today > > that have a sizeof(wchar_t) of 16 bits. If it does, let's just issue > > a warning at configure time. > > The elephant in the room here is Windows. If we ever want to > have native Windows support, its wchar_t is 16 bits and that's > unlikely to change as I understand it. > > I'm treading outside the bounds of my understanding of unicode > now. But I read a bit of the web site, and from what I can infer, > almost all the glyphs for modern natural languages are located below > 65365. The "code points" above that are for ancient languages and > math symbols etc. It seems to depend on who you ask. I've seen claims that some of the high-plane code points are important, and I've seen claims of the opposite. OK, now I have some higher-level i18n issues to raise, so stay tuned :-) -- "...I've forgotten where I was going with this, but you can bet it was scathing." --DesiredUsername _______________________________________________ pspp-dev mailing list [email protected] http://lists.gnu.org/mailman/listinfo/pspp-dev
