Hi,

On 7/3/11, Eric Auer <e.a...@jpberlin.de> wrote:
>>>
>>> That could work with a suitably compressed representation, e.g. only
>>> using the start of Unicode char space, or only discontinous subsets.
>>
>> Unicode (now at 6.0) is pretty damn huge. I don't know if...
>
> While Unicode is huge, DOS keyboard layouts tend to be limited to
> Latin and Cyrillic and some other symboly which is a tiny subset.

Well, determining which "subset" (for us) is the main problem.

> Of course there are "input methods" where you can type multiple
> keys and/or press complex combinations of keys to enter e.g. CJK
> glyphs (Chinese Japanese Korean) but that is a quite different
> story compared to normal keyboard drivers. You would also have
> to use DBCS (DOS wide 16 bit character support) which yet again
> needs extra drivers and support by DOS apps might be limited.

I'm actually surprised (and forgot) how many languages had pretty
clever workarounds for using 7-bit ASCII. Wikipedia lists a lot of
popular variations for non-Roman-alphabet languages.

> If you do not count CJK and right-to-left languages and REALLY
> exotic languages and symbols (maths, dingbats), Braille etc etc
> then the number of Unicode characters that people are likely to
> type on their keyboard in DOS is quite manageable. Of course it
> is still fine to have a somewhat more complete font in DISPLAY.

Right-to-left might be hard to do (I guess?), but technically as long
as they can see and enter what they want, I'm sure they can get used
to left-to-right. BTW, there was an old Forth for DOS with Korean font
support called hforth ('h' for Han). And Common Forth (cfdjgpp2)
supported Chinese, from a quick look.

>> that). And then I (erroneously?) thought BMP ("basic multilingual
>> plane") was the easy, two-byte Western portion, but apparently that's
>> not true.
>
> Interestingly, Unicode has a way of encoding 20 bit characters
> as pairs of 16 bit surrogates. But even then, the first 64k
> pretty much cover all normal stuff and even in that set, about
> half is simply the large collection of CJK symbols as you see:

I think I read on Wikipedia the other day that Unicode was originally
only 16-bit, e.g. they thought it would cover "most popular languages
currently in use", but it was later expanded.

>> 1). Chinese (hard)
>
> See above.

We'd have to ask someone "in the know", e.g. Johnson Lam. I think he
had some primitive workaround for PG.

>> 4). Arabic (easy??)
>
> Unicode lists maybe 300 chars for that, at most.

Really? Wikipedia lists 28 char alphabet (single case), IIRC.

>> 5). Hindi
>
> The writing system is "Devanagari", case insensitive,
> has ligatures, not many characters, like Bengali?

Apparently the Sanskrit alphabet, aka Deva-nagari or just Nagari. Has
some interesting workarounds (e.g. ISCII, I think).

> Similar to what happens with Cyrillic, there is ISCII
> which puts ASCII and Devanagari together in 256 chars,
> even with Bengali and some other scripts (approx?).

There you go, you saw Wikipedia too!   ;-)

>> 6). Bengali
>
> Apparently has ligatures and is case-insensitive?

Aka, Bangla (from Bangladesh), uses Eastern Nagari (similar but not
same). Looks like it could fit in a code page! Interesting workarounds
include IAST and ITRANS.

>> 7). Portuguese (easy)
>
> Indeed.

Henrique!!!

>> 8). Russian (easy??)
>
> The well-known cyrillic codepages squeeze ASCII and Cyrillic
> (probably not all theoretically possible accents) in 256 chars.

Probably like others only includes the "important" stuff.

>> 9). Japanese (hard)
>
> See above.

I didn't even look this one up, but I vaguely remember reading once
that they use two or three scripts (ugh): hiragana, kanji, etc. (EDIT:
Seems I forgot katakana.)

>> 10). German, Standard (easy)
>
> Latin1, yes.

BTW, wasn't your major in something like computational linguistics? Of
course, I have no idea what that is. But yeah, languages are fun, even
for a dummy like me.

>> own scripts are a problem, not to mention those like CJK that have
>> thousands of special characters. (e.g. Vietnamese won't fit into a
>> single code page, even.)
>
> When you have Unicode, you do not need codepages.

Right. And when you have a 286 or 386, you don't need to limit to 1 MB
of RAM.   ;-))

> you either have to encode Unicode (or similar encodings) as
> 16 bit characters with DBCS, possibly even 2 per character in
> the surrogate case for 20 bit encodings (Linear B or old Sumer
> Cuneiform from 3000 BC anybody? :-D) or as sequence of single
> bytes in UTF-8. The latter is convenient because frequent DOS
> charsets like ASCII or Latin need only 1-2 bytes while you can
> still encode up to 31 bits: U+07FF still fits 2 bytes and all
> 16 bit chars need only 3 bytes, the rest is very rare...

I think the real (proposed) advantage is that it doesn't waste space
if your main language(s) are Western. Also the byte stream is
recoverable if interrupted (so you can tell and resume at next valid
char). I think.  :-/

> Any software which tries to do layout (say, line wrapping or
> tables) has to understand how UTF-8 encodes 1 character as 1-
> or-more bytes, otherwise the layout gets messy. Still, if you
> have a DISPLAY with UTF-8 support, all ASCII (0-127) will be
> as normal and compatible with any ancient software :-)

Ugh, such a pain. But we do have some Unicode-aware tools (e.g. JED
0.99.16+ or VILE or GNU Emacs and of course Mined). I also know that
OpenWatcom's vi became 8-bit friendly not too long ago.

Well, old computer languages like Ada83 were 7-bit only, but later
Ada95 was 8-bit friendly (and even Modula-3 defaulted CHAR to
Latin-1). But some (like Java) default to UTF-16 (or maybe UCS-16, is
there a difference?). I'm not sure why I felt the need to mention it,
just saying, "it depends" (and is complicated). Perhaps my point is
that it wasn't urgent to support "everything" then and probably isn't
now either.  :-P

>> Nevertheless, perhaps some way of combining would make the most sense
>> to me, at least for Latin / Roman alphabets. 'a' + macron or 'a' +
>> circumflex or whatever. Then you wouldn't have to store ten million
>> redundant letters that only differ in accents...
>
> On one hand, it saves time with font design.

That's what I was thinking. And yet I was despairing more and more, I
even wondered if just supporting IPA directly would save time / space
somehow.   o_O     Doubt it, approx. 157 chars needed (too big for a
code page unless you cut out part of the ASCII compatibility). I'm
probably way off base with reality here, just thinking outloud.

> On the other, now
> that you mention it, Unicode also has COMBINING characters, in
> particular of course diacritics. You put those after any char,
> yet you see them in the same column as the char.

Right, but most Unicode-aware software isn't combining friendly (last I heard).

> Some chars can
> even have multiple diacritics. Yet if your font cannot combine,
> or if the combination does not make sense, software tends to
> display the accent AFTER the character as separate char. Also,
> you can "normalize" the combinations together. In particular in
> Latin codepage languages, the combination of char plus accent
> very often already exists as ONE character so software which
> can figure that out does not need the ability to graphically
> combine chars with separately stored diacritics in the font.

Well, I was just thinking how to save space. We don't need a 10 MB
file to lug around, do we? (Well, probably .... And not that anybody
would complain, as long as it worked.)

> Coincidentally, I wrote a little program which does such a
> normalization in Java (but hey, that is almost C)

<off-topic>

Somebody recently ported DOSBox to Java, BTW, so it's not "that"
different to standard C/C++.    http://jdosbox.sf.net

We used to have Kaffe for DOS, but I never tried it (old!).

BTW, even OS/2 (eCS) got recent Java port, so if they can get one,
anything's possible!!

</off-topic>

>> Probably easier to just tell them, "Use ICONV.EXE" (or Mined, Blocek,
>> Foxtype, etc).   :-)
>
> Of course - conversion and graphical Unicode text editors like Blocek
> will work fine and without limitation to 256 chars per codepage :-)

But do most people even view or edit multiple languages (of different
families) concurrently???

> Leads to the question what else you want to do with Unicode, and one
> such thing will be file names.

(BARF!) "Modern" software still can't even handle spaces, dollar
signs, periods, tildes, exclamation points, and other "weird"
characters, much less Unicode.

>> Well, we'd have to rebuild those programs. But that means C with
>> widechar support, and I'm not sure which compilers support that.
>
> That depends a lot on which programs we really want to recompile.
> And I would not be surprised if OpenWatcom or DJGPP had wide chars.

I don't know, but I'm pretty sure DJGPP doesn't (or not well, at
least). Not sure about OW since it might (old Japanese compiler texts
??).

> Even if not, not many methods will break if you treat UTF-8 as if
> it were 1 byte per char. Of course doing a substring or similar and
> cutting 2 or more bytes which are part of the same char apart will
> mean that your result will be invalid UTF-8 and will look trashy.
>
> Still, a few carefully chosen macros could be enough to make some
> sort of UTF-8 support toolkit even for non-Unicode compilers, so
> you could more easily port your software with help of the macros.

But developers can't even be bothered to do simple things already, so
it's unlikely they want more "workarounds" (sadly). But hey, that's
their problem.

>>> PS: That DISPLAY could store a big Unicode font in XMS and cache a
>>> number of recently used chars, or run entirely in protected mode.
>>
>> XMS already assumes 286, so jumping to 386 pmode wouldn't be a far
>> stretch. (I would be surprised if anybody besides Japheth understands
>> 286 pmode these days. It's certainly 1000x less popular than 386
>
> Correct, but one would have to check the performance of that. Yet
> both XMS and software EMS have overhead and given the very coarse
> granularity of EMS (4k or 16k) it might not be that cool for other
> people apart from Jim Leonard with his 8088 with EMS ISA card ;-)

I forgot that the DPMI standard supports 286 and 386, but writing a
TSR for DPMI is pretty much hard to (not quite) impossible (and ugly).
I know we're not necessarily saying TSR here, and 286 pmode tools are
fairly rare, but still .... At least most DOS extenders support
various kinds of memory schemes.

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Freedos-user mailing list
Freedos-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-user

Reply via email to