i18n

John Darrington Fri, 17 Mar 2006 05:53:24 -0800

On Sun, Mar 12, 2006 at 07:57:44PM -0800, Ben Pfaff wrote:
     
     thinking about internationalization.  I've been reworking the
     PostScript driver to better support i18n, and I've been thinking
     about how to better support it in PSPP in general.  I'll likely
     check in a new PostScript driver that does a better job, but it's
     hard to say when.  I've been reading the Unicode standard and
     documentation from Adobe and others, trying to learn as much as I
     can about these issues.


Some thoughts on internationalisation, which are only slightly
coherent at the moment, but I thought needed to be aired anyway.
Please forgive me for using this list as scrap paper.

0. Data strings that need internationalisation include: 

   * String Variable Data.
   * Variable Names.
   * Value Labels.
   * Variable Labels.
   * File Labels.
   * Document Text.

1. If the system file format had been properly defined, it would
   have stored the encoding used for its strings somewhere in the
   file.   The fact of the matter is, that it doesn't.

2. Therefore, we have to a) make a reasonable guess as to what a
   system file's encoding is; and  b) ensure that reasonable behaviour
   ensues if that assumption is incorrect.  We have to bear in mind
   that PSPP can deal with more than one system file at the same time
   eg: through the MATCH FILES command, and these could have been
   written in different encodings.

   2a might be acheived by i) using the LC_CTYPE environment variable,
   ii) using the value set be SET LOCALE; or iii) we could introduce
   an optional subcommand to the GET command to specify the locale.

   2b might be achieved by heuristics, using a library such as unac
   http://home.gna.org/unac/unac.en.html or if all else fails, replace
   unknown byte sequences by "...."

3. At some level within PSPP we need to decide on an interface where
   all strings will have a common encoding.  For instance, one
   possibility would be to decide that all strings contained within
   the dictionary would be utf8.  In this case, we'd need to convert
   all string data to utf8 within the struct variable (except short_name).

   Whilst that's feasible, casefiles    cannot possibly (in the
   current system) have this invariant, because the system files which
   implement them may not in fact be utf8 and converting a casefile
   doesn't scale.

   An alternative, would be to decide that it is the responsibility of
   the user interface and output subsystem to convert to utf8.  In
   which case, both these entities need to know the encoding of the
   data they receive.  Since, (as in the case of MATCH FILES)
   variables can come from different system sources, each variable
   within a dictionary may have a different encoding.   Thus it may be
   desirable to add an encoding property to struct variable.

4. However, when writing a system file, it would be sensible to
   convert all variables to a common encoding first.



-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.

pgpGFjoqDLK2Z.pgp
Description: PGP signature

_______________________________________________
pspp-dev mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/pspp-dev

i18n

Reply via email to