Hi Rugxulo,

>>> Unicode (now at 6.0) is pretty damn huge. I don't know if...
>> While Unicode is huge, DOS keyboard layouts tend to be limited to
>> Latin and Cyrillic and some other symboly which is a tiny subset.
> Well, determining which "subset" (for us) is the main problem.

You could start using RECODE (DJGPP port if you like) and
convert all DOS keyboard layouts that you can find from
the codepage for which they are made into Unicode, then
make a list of all distinct characters that you found...

I would be surprised if it were more than 1000 of them,
so even the "12000 useful chars" is a very high estimate.

>> Of course there are "input methods" where you can type multiple
>> keys and/or press complex combinations of keys to enter e.g. CJK
>> glyphs (Chinese Japanese Korean) but that is a quite different

Therefore, as somebody else already said, topic for another thread.

> Right-to-left might be hard to do (I guess?)

Not really, but it often means MIXING directions when you want
to mention ASCII words on a right to left system, I would guess.
Yet again something for another thread or simply for Blocek ;-)

> I think I read on Wikipedia the other day that Unicode was originally
> only 16-bit, e.g. they thought it would cover "most popular languages
> currently in use", but it was later expanded.

Yes. And as said, while 20 bit can be encoded as 2 surrogates
of 16 bit each, UTF-8 can be used to encode up to 31 bits and
apart from UTF-16 there is of couse a possibility for UTF-32.

>>> 1). Chinese (hard)

Well just needs multiple infrastructure things such as special
DISPLAY driver (graphics mode, e.g. 16x16 font or 2*8x16 text
if you can manage to do everything with 512 half char shapes?)
and special keyboard input method driver and kernel with DBCS
plus of course DBCS awareness in apps which want to use CJK...

>>> 4). Arabic (easy??)
>> Unicode lists maybe 300 chars for that, at most.
> Really? Wikipedia lists 28 char alphabet (single case), IIRC.

I was just checking the ranges of char numbers, not how
well they are actually populated. Maybe accents added?

[Hindi Devangari Bengali...]

Well sounds like a case for an ISCII codepage font :-)

>> The well-known cyrillic codepages squeeze ASCII and Cyrillic
>> (probably not all theoretically possible accents) in 256 chars.
> Probably like others only includes the "important" stuff.

Not necessarily. Like Latin, Cyrillic does not have THAT
many accents in the language family. The problem is that
DOS codepages often try to have too many symbols or even
box drawing chars. Which in turn means that there is no
DOS codepage with ALL Latin accented chars in it, even
though I get the impression that 2 * 60 accented chars
would cover the big majority of all Latin like writings.

>>> 9). Japanese (hard)
> I didn't even look this one up, but I vaguely remember reading once
> that they use two or three scripts (ugh): hiragana, kanji, katakana

In general the CJK language family seems to have simplified
and more ornamental / older versions for their big set of
word / syllable like glyphs. Plus indeed one or two ways of
writing more alphabetically for e.g. foreign words. And the
latter is small - I remember that small text-only LCD matrix
displays (e.g. with 5x7 font) only use 7-8 bits per char :-)

> BTW, wasn't your major in something like computational linguistics?

You remember my old email address at coli.uni-sb.de correctly ;-)

>> charsets like ASCII or Latin need only 1-2 bytes while you can
>> still encode up to 31 bits: U+07FF still fits 2 bytes and all
>> 16 bit chars need only 3 bytes, the rest is very rare...
> I think the real (proposed) advantage is that it doesn't waste space
> if your main language(s) are Western. Also the byte stream is
> recoverable if interrupted...

Yes, which means if you send UTF-8 to a display which expects
1 byte per char (e.g. Latin) or Latin to a display which wants
UTF-8, the mess will be local to around the non-ASCII parts :-)

Also, while 2 bytes of UTF-8 for hex 0 to 7ff might focus on
western languages, 3 bytes for up to 16 bits of Unicode for
all CJK glyphs and almost all other writing systems is okay.
Of course CJK people might still prefer then-smaller UTF-16?

> Right, but most Unicode-aware software isn't combining friendly

Dunno. I get the impression that "your mileage may vary", in
particular if you use rare (combinations of) accents. Also,
not all software uses accents in a well-defined way either.

> Well, I was just thinking how to save space.

Even if you precompose chars with their accents, it will
compress quite well as a font file ;-)

> But do most people even view or edit multiple languages

I remember that the Cyberbit bitstream TTF font also had
a non-CJK edition which is only a few 100 kB AFAIR while
still covering many languages. Maybe no Cherokee or such.

> I forgot that the DPMI standard supports 286 and 386, but writing a
> TSR for DPMI is pretty much hard to (not quite) impossible (and ugly).

I somehow doubt that. You could do something small and evil
such as hooking basic int 10 functions like function 0e, TTY.
No need to to big complex multi interrupt many I/O and API
activities etc stuff. Just receive text and render graphics.

Of coure it will not work with apps which write to b800:xyz,
so trapping and redirecting that would be the bonus exercise
but I think I even did that in real mode once. Not as a real
trap but keeping 128 kB of graphics RAM from a000 to bfff on
and periodically checking b800:xyz for changes which would
then be rendered with a font as graphics. Very long ago ;-)

Eric :-)

All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
Freedos-user mailing list

Reply via email to