On Sun, Mar 12, 2006 at 07:57:44PM -0800, Ben Pfaff wrote:
thinking about internationalization. I've been reworking the
PostScript driver to better support i18n, and I've been thinking
about how to better support it in PSPP in general. I'll likely
check in a new PostScript driver that does a better job, but it's
hard to say when. I've been reading the Unicode standard and
documentation from Adobe and others, trying to learn as much as I
can about these issues.Some thoughts on internationalisation, which are only slightly coherent at the moment, but I thought needed to be aired anyway. Please forgive me for using this list as scrap paper. 0. Data strings that need internationalisation include: * String Variable Data. * Variable Names. * Value Labels. * Variable Labels. * File Labels. * Document Text. 1. If the system file format had been properly defined, it would have stored the encoding used for its strings somewhere in the file. The fact of the matter is, that it doesn't. 2. Therefore, we have to a) make a reasonable guess as to what a system file's encoding is; and b) ensure that reasonable behaviour ensues if that assumption is incorrect. We have to bear in mind that PSPP can deal with more than one system file at the same time eg: through the MATCH FILES command, and these could have been written in different encodings. 2a might be acheived by i) using the LC_CTYPE environment variable, ii) using the value set be SET LOCALE; or iii) we could introduce an optional subcommand to the GET command to specify the locale. 2b might be achieved by heuristics, using a library such as unac http://home.gna.org/unac/unac.en.html or if all else fails, replace unknown byte sequences by "...." 3. At some level within PSPP we need to decide on an interface where all strings will have a common encoding. For instance, one possibility would be to decide that all strings contained within the dictionary would be utf8. In this case, we'd need to convert all string data to utf8 within the struct variable (except short_name). Whilst that's feasible, casefiles cannot possibly (in the current system) have this invariant, because the system files which implement them may not in fact be utf8 and converting a casefile doesn't scale. An alternative, would be to decide that it is the responsibility of the user interface and output subsystem to convert to utf8. In which case, both these entities need to know the encoding of the data they receive. Since, (as in the case of MATCH FILES) variables can come from different system sources, each variable within a dictionary may have a different encoding. Thus it may be desirable to add an encoding property to struct variable. 4. However, when writing a system file, it would be sensible to convert all variables to a common encoding first. -- PGP Public key ID: 1024D/2DE827B3 fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3 See http://pgp.mit.edu or any PGP keyserver for public key.
pgpGFjoqDLK2Z.pgp
Description: PGP signature
_______________________________________________ pspp-dev mailing list [email protected] http://lists.gnu.org/mailman/listinfo/pspp-dev
