Re: Non-latin encoding and languages

Paul Rohr Tue, 29 Feb 2000 18:09:52 -0600 (CST)
At 09:29 AM 2/29/00 -0600, [EMAIL PROTECTED] wrote:
>Yes, this does sound like a problem.  I have little experience with
>problems like these, but they'll need to be solved to make AbiWord
>usable to those outside of the Latin-1 set of languages.  
>
>As usual, there seem to be two problems (and I'll try to keep them
>short, so people more qualified than me can jump in).  The first 
>is internationalization of the document, so that users can use their
>native character sets to write documents.  The encoding issues here
>are specific to the fonts the user is using, and should map into
>Unicode space if we do our job correctly.  We haven't done much
>work on this front for non-Latin-1 language, so I wouldn't be surprised
>if we have no fonts that can handle KOI8-R encodings.
>
>Actually, problem 1.5 is the importers and exporters, which will need
>to do things like character mapping conversions (like you mention).

Nice summary.  To put it another way, there are potentially as many as 3 
charsets in play here:

  input -- from keyboard, word document, etc.
  document -- we're *always* storing Unicode internally 
  output -- fonts for display, and encodings for exporters 

We know that the following special cases work properly:

1.  Import/export of native ABW content should just work.  We store 
internally using UCS2 and both the importer and exporter crunch that down 
into network-safe ampersand-encoded XML content. 

2.  Typing and display for Latin-1 languages should also just work.  The 
naive mappings from keyboard to internals to display all work trivially. 

Henrik has started doing some of the necessary work to invoke the necessary 
charset conversions so that incoming content gets converted correctly to 
Unicode.  I'm not sure whether we've done any of the display conversions yet 
for non-Unicode fonts. 

However, there's probably an additional special case which happens to work 
right now, even though it shouldn't:

  input -- some funky code page
  document -- not really Unicode, because it never got converted
  output -- display using fonts in that same funky code page

In this case, it might appear to users that everything just works, but the 
actual content being manipulated is "secretly" in that mystery code page, 
which means that the document is almost guaranteed not to be portable.  

This should definitely be fixed. 

>The second problem, which I just realized AbiWord has, is that localizations
>need encodings too.  All our current menu, toolbar, and string sets
>map into Latin-1 space, but for locales with more than one encoding,
>we may have to figure out a way to represent different them all.

Ick and double ick.  This may not be a problem for the strings mechanism, 
since the XML header mentions the charset used and expat presumably is 
converting that content into Unicode for us.  So long as the GUI font is 
also in the right Unicode range, this may just work. 

However, the string literals in the hardwired toolbar and menu tables will 
only be portable for Latin-1 languages (where no charset conversion is ever 
needed).  Here again, the issue will be -- does the charset used for storing 
the literal match the GUI font being used on the current platform?  Chances 
are that it doesn't.  Blech.  

This should definitely be fixed, too, but I'm not sure what the right 
solution here will be.  

We certainly don't want to start storing the same translation in multiple 
charsets (one per platform) -- that'd be a maintenance nightmare.  

Sigh. 

Paul
Re: Non-latin encoding and languages

Reply via email to