On Thu, Jul 01, 1999 at 03:51:59PM -0400, Adam Di Carlo wrote: > Yes, I think we *can* assume (without further evidence) that the > output is in ISO Latin 1, that is, the extended ASCII 255 character > set. Sorry, I do not understand. How can we assume the output is in ISO Latin 1 if I see the contrary. :)
> In fact, ISOLat1 can be considered a degenerate case of UTF8. I doubt that. It cannot. I believe for the upper half of ISOLat1, we would use two bytes, no? > >I believe the problem *you* encountered with > >processing Russian translation has origins in fact that the output files are > >not in Unicode. > > Actually, I think the situation is that they are charset koi8-r, but > with some ISO Latin 1 characters mixed in. Yes, but the latter could not be easily distinguished. > >For all charset that have (C) symbol for code 169, the output will look fine. > >Then, when you try to process the latex output from debiandoc2latex, you get > >a > >lot of errors since in cyrillic font there is no symbol with code 169. > > I can assure you that the current system works for American/European > language fine (german, french, english, probably much more), including > PDF, HTML, and all other outputs in all the applications (xpdf, > acroread, netscape, lynx, w3-el) I bothered to check. Yes, since they all use ISOLat1, where (C) has code 169. > >So the question is: what to do? > > Well, first off, I think moving from ISOLat1 (8 bit chars) to SDATA in > sgml-data will break a lot more than it's going to fix. For instance, > all the stuff working above would probably break (maybe you could test > that?) Certainly, since debiandoc2* scripts do no attempt for processing system data entities. > Secondly, I would suppose that you need to check for 8-bit ISOLat1 > character in your input stream, and convert them to whatever character > set you are using. And probably do this pretty early in the chain, > i.e., before we start branching out into TeX, HTML, etc etc. So, you propose replace all © to whatever is needed in koi8-r in case of Russian translation? Well, I just cannot do that: KOI8-R charset does not have (C) character. If you meant something else, please clarify your proposal. > >The one we use for making the documentation from DebianDoc DTD is Unicode > >aware? And do we really supply it with Unicode file? > > Well, see above. It is producing proper PDF files. Does you reference to PDF mean that in case of PDF we must use Unicode in some way? [ stuff about sgml-tools v1 skipped ] OK. I justed wanted to make an example of system where SDATA-entities are used with success. ⌣ > We should probably make the fact that debiandoc-sgml is default > encoded in ISOLat1 (or UTF8, if we wanna go there) more explicit in > the documentation or elsewhere. Perhaps this should be filed as a > bug on debiandoc-sgml. I do not quite understand why you think that debiandoc-sgml's output is encoded in ISOLat1? The output of debiandoc-sgml is just plain 8-bit stream. I believe nobody can tell what it is. One more issues (I just made a more throughly look on entities supplied by sgml-data. Why some files provide Unicode equivalents for entities and some proprietary SDATA? Is this by design? -- Mike

