At 09:29 AM 2/29/00 -0600, [EMAIL PROTECTED] wrote: >Yes, this does sound like a problem. I have little experience with >problems like these, but they'll need to be solved to make AbiWord >usable to those outside of the Latin-1 set of languages. > >As usual, there seem to be two problems (and I'll try to keep them >short, so people more qualified than me can jump in). The first >is internationalization of the document, so that users can use their >native character sets to write documents. The encoding issues here >are specific to the fonts the user is using, and should map into >Unicode space if we do our job correctly. We haven't done much >work on this front for non-Latin-1 language, so I wouldn't be surprised >if we have no fonts that can handle KOI8-R encodings. > >Actually, problem 1.5 is the importers and exporters, which will need >to do things like character mapping conversions (like you mention). Nice summary. To put it another way, there are potentially as many as 3 charsets in play here: input -- from keyboard, word document, etc. document -- we're *always* storing Unicode internally output -- fonts for display, and encodings for exporters We know that the following special cases work properly: 1. Import/export of native ABW content should just work. We store internally using UCS2 and both the importer and exporter crunch that down into network-safe ampersand-encoded XML content. 2. Typing and display for Latin-1 languages should also just work. The naive mappings from keyboard to internals to display all work trivially. Henrik has started doing some of the necessary work to invoke the necessary charset conversions so that incoming content gets converted correctly to Unicode. I'm not sure whether we've done any of the display conversions yet for non-Unicode fonts. However, there's probably an additional special case which happens to work right now, even though it shouldn't: input -- some funky code page document -- not really Unicode, because it never got converted output -- display using fonts in that same funky code page In this case, it might appear to users that everything just works, but the actual content being manipulated is "secretly" in that mystery code page, which means that the document is almost guaranteed not to be portable. This should definitely be fixed. >The second problem, which I just realized AbiWord has, is that localizations >need encodings too. All our current menu, toolbar, and string sets >map into Latin-1 space, but for locales with more than one encoding, >we may have to figure out a way to represent different them all. Ick and double ick. This may not be a problem for the strings mechanism, since the XML header mentions the charset used and expat presumably is converting that content into Unicode for us. So long as the GUI font is also in the right Unicode range, this may just work. However, the string literals in the hardwired toolbar and menu tables will only be portable for Latin-1 languages (where no charset conversion is ever needed). Here again, the issue will be -- does the charset used for storing the literal match the GUI font being used on the current platform? Chances are that it doesn't. Blech. This should definitely be fixed, too, but I'm not sure what the right solution here will be. We certainly don't want to start storing the same translation in multiple charsets (one per platform) -- that'd be a maintenance nightmare. Sigh. Paul
