On Fri, Jul 02, 1999 at 11:52:56AM -0400, Adam Di Carlo wrote: > >You see, the construct \|...\| can be easily cought since it's a special > >thing > >(`\' in input will be escaped with \ giving \\ in output). Well, in case of > >SDATA-entities, I see how to make use of them. > > I don't see why \|...\| just as easily as ╘. They are both unique! > Furthermore, if we can get the charset of the debiandoc char stream > sorted out, you can hook up *standard*, already written tools to go > from one char set to another. Hmm... It looks I just did not make it clear. Well, I stated that the output stream is in unknown character set (that is, CDATA is just copied to output), this means that the 8-bit code 169 stands for unknown symbol: if we knew that this is iso-8859-1, then it's (C), if it's koi8-r it's '_|'. If we find a way for making sure that output is in UCS-2, UCS-4, UTF* or other encoding that permit to have a lot of symbols from different languages, then yes, processing \|...\| is as easy as ╘, but we have a stream of 8-bit characters of unknown charset, so we have nothing but to create an external logic (like everything that starts with \ has special meaning) for distinguishing what we need.
> >I am sorry to say that the freshly downloaded and unpacked in a separate > >directory sgml-data package has ISO* files that define SDATA-entities. > > Yes indeed. This inconsistency seems to be a bug. OK. Should I file it? > >Well, and now returning to `stock' SGML entities. copy, and certain other > >entities (like nbsp, for example) are from ISOnum, while in sgml-data package > >they are defined in both of them (and they are different, BTW). > > Some overlap may be ok. ISO defines it -- not Debian! I beg your pardon? How this could be? Well, unfortunately, I do not have a copy of UNICODE standard. But I doubt that a <emphasis>standard</emphasis> could define the same thing in two or more ways: this is not even an ambiguity. Yes, I agree that we could have two sets of entities: defining UNICODE codes and system data. I believe in current situation we have a severe problem: first included set wins. That's really bad. > >As for working out this problem. There are two possibilities: to make use of > >SDATA entities in all programs that come with Debian; or to use some Unicode > >encoding for intermediate/output files. > > I opt for unicode. Unless there is a standard that the copyright > circle 'c' glyph needs to be '[copy ]' and not '[copy ]' nor > '[COPY ]', that is, unless I am given a guidelines by which to > distinguish the proper notation from the impostor, I am very hesitant > to do that. Adam, I opt for whatever permits us to deal with the problem: what we get is not what we want. I believe SDATA just provide a convenient way for dealing with certain symbols. Please understand that I do not insist on using SDATA-entities only, no, I just want to see circled c in text of Russian documentation as well as in all other versions too. -- Mike

