Hi Rugxulo, long Unicode story ahead :-)

>>> Regarding FD-KEYB 3.0, I've been think of some ways that it could be
>>> improved.  I think one thing that would help immensely is to turn the
>>> "translation" into a two-step process.  The first step would be to
>>> translate the input scancodes to Unicode, rather than trying to
>>> translate them directly to a code page character as is done now.
>> That could work with a suitably compressed representation, e.g. only
>> using the start of Unicode char space, or only discontinous subsets.
> Unicode (now at 6.0) is pretty damn huge. I don't know if...

While Unicode is huge, DOS keyboard layouts tend to be limited to
Latin and Cyrillic and some other symboly which is a tiny subset.

Of course there are "input methods" where you can type multiple
keys and/or press complex combinations of keys to enter e.g. CJK
glyphs (Chinese Japanese Korean) but that is a quite different
story compared to normal keyboard drivers. You would also have
to use DBCS (DOS wide 16 bit character support) which yet again
needs extra drivers and support by DOS apps might be limited.

If you do not count CJK and right-to-left languages and REALLY
exotic languages and symbols (maths, dingbats), Braille etc etc
then the number of Unicode characters that people are likely to
type on their keyboard in DOS is quite manageable. Of course it
is still fine to have a somewhat more complete font in DISPLAY.

> that). And then I (erroneously?) thought BMP ("basic multilingual
> plane") was the easy, two-byte Western portion, but apparently that's
> not true.

Interestingly, Unicode has a way of encoding 20 bit characters
as pairs of 16 bit surrogates. But even then, the first 64k
pretty much cover all normal stuff and even in that set, about
half is simply the large collection of CJK symbols as you see:

Each row is 4096 chars (each box 256) and about 3 rows worth
of boxes are somehow useful in left-to-right 8x16 DOS fonts.

> http://www.ethnologue.com/ethno_docs/distribution.asp?by=size

Nice site :-)

> 1). Chinese (hard)

See above.

> 2). Spanish (easy)
> 3). English (easy)

Both Latin1, indeed.

> 4). Arabic (easy??)

Unicode lists maybe 300 chars for that, at most.

> 5). Hindi

The writing system is "Devanagari", case insensitive,
has ligatures, not many characters, like Bengali?

Similar to what happens with Cyrillic, there is ISCII
which puts ASCII and Devanagari together in 256 chars,
even with Bengali and some other scripts (approx?).

> 6). Bengali

Apparently has ligatures and is case-insensitive?

> 7). Portuguese (easy)


> 8). Russian (easy??)

The well-known cyrillic codepages squeeze ASCII and Cyrillic
(probably not all theoretically possible accents) in 256 chars.

> 9). Japanese (hard)

See above.

> 10). German, Standard (easy)

Latin1, yes.

> own scripts are a problem, not to mention those like CJK that have
> thousands of special characters. (e.g. Vietnamese won't fit into a
> single code page, even.)

When you have Unicode, you do not need codepages. But as said,
you either have to encode Unicode (or similar encodings) as
16 bit characters with DBCS, possibly even 2 per character in
the surrogate case for 20 bit encodings (Linear B or old Sumer
Cuneiform from 3000 BC anybody? :-D) or as sequence of single
bytes in UTF-8. The latter is convenient because frequent DOS
charsets like ASCII or Latin need only 1-2 bytes while you can
still encode up to 31 bits: U+07FF still fits 2 bytes and all
16 bit chars need only 3 bytes, the rest is very rare... But:
Any software which tries to do layout (say, line wrapping or
tables) has to understand how UTF-8 encodes 1 character as 1-
or-more bytes, otherwise the layout gets messy. Still, if you
have a DISPLAY with UTF-8 support, all ASCII (0-127) will be
as normal and compatible with any ancient software :-)

> Nevertheless, perhaps some way of combining would make the most sense
> to me, at least for Latin / Roman alphabets. 'a' + macron or 'a' +
> circumflex or whatever. Then you wouldn't have to store ten million
> redundant letters that only differ in accents...

On one hand, it saves time with font design. On the other, now
that you mention it, Unicode also has COMBINING characters, in
particular of course diacritics. You put those after any char,
yet you see them in the same column as the char. Some chars can
even have multiple diacritics. Yet if your font cannot combine,
or if the combination does not make sense, software tends to
display the accent AFTER the character as separate char. Also,
you can "normalize" the combinations together. In particular in
Latin codepage languages, the combination of char plus accent
very often already exists as ONE character so software which
can figure that out does not need the ability to graphically
combine chars with separately stored diacritics in the font.

Coincidentally, I wrote a little program which does such a
normalization in Java (but hey, that is almost C) because
the built-in Java function tended to crash with impossible
combinations. In Java that just means sending an exception
to the caller, but it was still annoying :-p The program is
basically a bunch of look-up tables for the 10 "actually"
occuring combining diacritics (300..304, 308, 30c, 30f, 328
and 331 hex) and maybe 10-20 of individual other combinations.
Still it covers everything in a big pile of text files tested.

>>> This may also be the first step towards getting Unicode somewhat
>>> integrated into DOS, which I know people have been talking about for
>>> a long time.  The second step of the process (Unicode -> Code Page)
>>> could even be implemented as a separate API (perhaps an INT 16h
>>> extension, or something related to DISPLAY?)
> Probably easier to just tell them, "Use ICONV.EXE" (or Mined, Blocek,
> Foxtype, etc).   :-)

Of course - conversion and graphical Unicode text editors like Blocek
will work fine and without limitation to 256 chars per codepage :-)

Leads to the question what else you want to do with Unicode, and one
such thing will be file names. Those, however, are often not really
critical to layout, so UTF-8 might be sufficient there even if your
FreeCOM or 4DOS command.com or similar just believes that UTF-8 is
a very weird codepage, not being aware that UTF-8 can mean multiple
bytes per character.

>> As discussed earlier, there could be a DISPLAY which processes UTF8
>> instead of ASCII, but it will confuse programs which assume that all
>> string lengths are equal to their byte lengths for the layouting...
> Well, we'd have to rebuild those programs. But that means C with
> widechar support, and I'm not sure which compilers support that.

That depends a lot on which programs we really want to recompile.
And I would not be surprised if OpenWatcom or DJGPP had wide chars.

Even if not, not many methods will break if you treat UTF-8 as if
it were 1 byte per char. Of course doing a substring or similar and
cutting 2 or more bytes which are part of the same char apart will
mean that your result will be invalid UTF-8 and will look trashy.

Still, a few carefully chosen macros could be enough to make some
sort of UTF-8 support toolkit even for non-Unicode compilers, so
you could more easily port your software with help of the macros.

>> PS: That DISPLAY could store a big Unicode font in XMS and cache a
>> number of recently used chars, or run entirely in protected mode.
> XMS already assumes 286, so jumping to 386 pmode wouldn't be a far
> stretch. (I would be surprised if anybody besides Japheth understands
> 286 pmode these days. It's certainly 1000x less popular than 386

Correct, but one would have to check the performance of that. Yet
both XMS and software EMS have overhead and given the very coarse
granularity of EMS (4k or 16k) it might not be that cool for other
people apart from Jim Leonard with his 8088 with EMS ISA card ;-)


All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
Freedos-user mailing list

Reply via email to