Re: [Freedos-devel] ASCII to unicode table

2011-01-16 Thread Eric Auer

Hi Alex,

> * in some cases best readability is the target [best user experience]
> * in some cases exact string representation is the target [copy+paste, debug]
> * in some other cases you simply want to be fast [viewing text/binary files]
...
> for exact representations any sort of escaped character sequences might be 
> used.
> readability instead requires different substitution rules.
...
> for the fast case you dont care about accuracy but only
> send e.g. a dot to the console for anything not simply
> convertible - thats how hex editors do it for ages.

That is interesting, but I also wonder: Which size of font
do people want and where do they want to process Unicode?

In file contents, file names, URLs? Only in a special app
(e.g. Blocek Unicode text editor for DOS) or everywhere?

Do they also want to type Unicode? Or maybe use some sort
of popup char table to enter Unicode? Or just not type it?

Eric


--
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand 
malware threats, the impact they can have on your business, and how you 
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] ASCII to unicode table

2011-01-16 Thread Alexander Stohr
My personal summary on the practical aspect is:

* in some cases best readability is the target
* in some cases exact string representation is the target
* in some other cases you simply want to be fast

in the first case you are probably targetting for best user expirience.
in the second case you are targetting for e.g. debugging or copy&past on the 
shell.
in the third case you are probably listing a text or binary file on the console.

for exact representations any sort of escaped character sequences might be used.

readability instead requires different substitution rules.
for this its possible but not in all cases equally desirable
to change character sets and fonts of the displaying console.
and probably the most determining factor: the reversal of
the substitution rules will have much of ambiguities in most
cases - only a few cases (e.g. the german char set with only
some 7 extra characters) have no ambiguities. so you should
save you a brainer never doing backwards translations.

for the fast case you dont care about accuracy but only
send e.g. a dot to the console for anything not simply
convertible - thats how hex editors do it for ages.

regards, Alex.

--
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand 
malware threats, the impact they can have on your business, and how you 
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] ASCII to unicode table

2011-01-15 Thread Jim Michaels
is DPMI out of the question?




From: Eric Auer 
To: freedos-devel@lists.sourceforge.net
Sent: Wed, December 1, 2010 2:59:56 PM
Subject: Re: [Freedos-devel] ASCII to unicode table


Hi Christian,

Using UTF-8 CON with a codepage based app or vice versa
is worse for block graphics than just using the wrong
codepage, as not only the shape but also the number of
displayed characters will change: Everything outside of
basic ASCII takes 2 or more bytes in UTF-8, so display
on codepage-CON will show block graphics "too wide".

In the other direction, sequences can contain invalid
groups or invalid start bytes, so trying to show block
graphics from codepage-apps on UTF8-CON will typically
show as many or fewer "bad char" chars than the number
of block graphics chars that the app wanted to display.



>> A possible workaround would be dosver-style, to make
>> a per-app decision who uses Unicode.

Because DOS is not multitasking, you do not have to put
status flags in the PSP...  You just switch to codepage
(or whatever default you want) mode when anything exits
and switch to UTF8 mode (...) when either an app starts
which you know to be UTF8 tolerant or when a modern app
explicitly switches to UTF8 mode. You are right that a
TSR pop-up would not fit in that scheme BUT as far as I
remember, pop-ups always write to the VGA directly, so
they cannot use the UTF8 CON. If the UTF8 CON uses the
graphics mode to render text (because otherwise you can
only keep a small 512 char font in hardware) it is very
possible that you will not see your TSR pop-up at all.



> I'd propose to use a new interface instead - this new interface
> then always uses UTF-8, the normal one will use code pages (or
> reject CP-dependent characters). (Of course using only ASCII it
> doesn't matter which interface you use.)

What that could mean is having a UNICODE$ char device,
similar to the existing MORE$ device which you already
know: It forwards text to CON but waits for a keypress
after every 25 line breaks, so MORE$ (moresys) shows
text immediately where MORE (the app) has to wait for
all text to arrive first before starting to show any
of it (because DOS does not have real | pipelines...).

Well... Coming back from this blatant ad ;-) A driver
which provides this UNICODE$ device could either do a
"best effort" translation of incoming text to whatever
the current codepage is at that moment or it could do
a graphical rendering of the text. In the latter case,
you can only show text while the VGA is in graphics
mode which is acceptable for classic CON (just slow)
but which, as said, will break your TSR pop-up text.



> The DOS LFN API works with code page encoded strings.

Wow. Well at least the DOS LFN directory item data is
based on Unicode already. So it could have been worse.

Eric


--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel



  --
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand 
malware threats, the impact they can have on your business, and how you 
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] ASCII to unicode table

2010-12-01 Thread Eric Auer

Hi Christian,

Using UTF-8 CON with a codepage based app or vice versa
is worse for block graphics than just using the wrong
codepage, as not only the shape but also the number of
displayed characters will change: Everything outside of
basic ASCII takes 2 or more bytes in UTF-8, so display
on codepage-CON will show block graphics "too wide".

In the other direction, sequences can contain invalid
groups or invalid start bytes, so trying to show block
graphics from codepage-apps on UTF8-CON will typically
show as many or fewer "bad char" chars than the number
of block graphics chars that the app wanted to display.



>> A possible workaround would be dosver-style, to make
 >> a per-app decision who uses Unicode.

Because DOS is not multitasking, you do not have to put
status flags in the PSP...  You just switch to codepage
(or whatever default you want) mode when anything exits
and switch to UTF8 mode (...) when either an app starts
which you know to be UTF8 tolerant or when a modern app
explicitly switches to UTF8 mode. You are right that a
TSR pop-up would not fit in that scheme BUT as far as I
remember, pop-ups always write to the VGA directly, so
they cannot use the UTF8 CON. If the UTF8 CON uses the
graphics mode to render text (because otherwise you can
only keep a small 512 char font in hardware) it is very
possible that you will not see your TSR pop-up at all.



> I'd propose to use a new interface instead - this new interface
 > then always uses UTF-8, the normal one will use code pages (or
 > reject CP-dependent characters). (Of course using only ASCII it
 > doesn't matter which interface you use.)

What that could mean is having a UNICODE$ char device,
similar to the existing MORE$ device which you already
know: It forwards text to CON but waits for a keypress
after every 25 line breaks, so MORE$ (moresys) shows
text immediately where MORE (the app) has to wait for
all text to arrive first before starting to show any
of it (because DOS does not have real | pipelines...).

Well... Coming back from this blatant ad ;-) A driver
which provides this UNICODE$ device could either do a
"best effort" translation of incoming text to whatever
the current codepage is at that moment or it could do
a graphical rendering of the text. In the latter case,
you can only show text while the VGA is in graphics
mode which is acceptable for classic CON (just slow)
but which, as said, will break your TSR pop-up text.



> The DOS LFN API works with code page encoded strings.

Wow. Well at least the DOS LFN directory item data is
based on Unicode already. So it could have been worse.

Eric


--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] ASCII to unicode table

2010-12-01 Thread Christian Masloch
> You would need an Input Method driver which lets you type
> complex key sequences or combinations to type in a language
> which has more than the usual few dozen chars of alphabet.

Yes. The (keyboard) input and (screen) output appears to be the most  
complicated exercise here. DBCS or UTF-8 support inside other programs  
would appear less complicated - as far as I know, DOSLFN properly supports  
DBCS. (UTF-8 appears to be easier than DBCS, but I didn't look into the  
details of the latter.)

> In addition, you get a sort of graceful degradation: Tools
> which are not Unicode-aware would treat the strings as if
> they use some unknown codepage. So such tools would think
> that AndrXX where XX is an encoding for an accented e has 6
> characters but at least you can still see the "Andr" in it.
>
> In the other direction, if you accidentally put in a text
> with Latin1 or codepage 858 / 850 encoding, you get AndrY
> where Y is the codepage style encoding of the accented "e"
> and the Y and possibly one char after it would be shown in
> a broken way by a CON driver which expects UTF8 instead.

Arguably, the UTF-8 "compatibility" is worse here: with the actual  
encoding in any code page (not DBCS or UTF-8), displaying the string in  
another code page will replace each non-ASCII character by one random  
character of the active code page. With UTF-8, non-ASCII character are  
encoded as multi-byte sequences - resulting in several random characters  
of the active code page, where actually only one code-point is encoded.

> I do not understand the "codepoints are 24 bit numbers"
> issue. Unicode chars with numbers above 65535 are very
> exotic in everyday languages

That is why I said it's not that important.

> If you mean UTF8,

No. That would not make sense. A code-point is usually written like  
"U+0038", with 4 to 6 hexadecimal digits that give you the numeric value  
of that code-point. The "character set", Unicode, defines code-points. The  
encoding, UTF-8, defines how (almost) arbitrary numeric values are to be  
encoded into a stream of bytes. UTF-8 support easily scales to support all  
currently reserved code-points which do not fit into a 16-bit number, if  
the underlying interface supports them. (A 21-bit number is large enough  
for all code-points.)

> I think Mac / Office sometimes might use
> one of the UTF16 encodings but otherwise they are not
> so widespread.

Don't forget FAT's long file names ;-)

Regards,
Christian

--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] ASCII to unicode table

2010-12-01 Thread Christian Masloch
> Combined with, for example, a UTF-8 enabled Super-NANSI to
> make the step from strings to their display, of course. The
> problem would be loss of "ASCII" art block graphics in apps
> which are not using Unicode.

But that happens for some code pages anyway. (For example, CPs 858 and 850  
drop some of the CP 437 block graphics. CPs that need more characters  
probably drop all of them.)

> A possible workaround would be
> dosver-style, to make a per-app decision who uses Unicode.
>
> [...]
>
> Some old apps will only use ASCII anyway which is the same
> for real ASCII and for UTF8 but some others will assume a
> codepage (often 437) to be active. The block graphics and
> other chars from the non-ASCII half of any codepage differ
> in encoding from UTF8 so, as said, any display or similar
> driver would need some way to switch between "classic code
> page mode" and "UTF8 rendering mode". It could switch on
> UTF8 based on explicit request from a modern app or based
> on app name for old but known compatible apps... It would
> switch off UTF8 when any app exits (int 21.4c / 21.31...).

I don't like such an approach. You would have to keep the current status  
in a PSP field. And even then, pop-up TSRs might *interrupt* the currently  
running process (without switching the PSP or saving/restoring other  
fields). One of the TSRs I'm regularly using displays its pop-up using  
block graphics.

I'd propose to use a new interface instead - this new interface then  
always uses UTF-8, the normal one will use code pages (or reject  
CP-dependent characters). (Of course using only ASCII it doesn't matter  
which interface you use.)

> If yes, I do
> assume that the LFN API already is explicit about whether
> UTF8 or rather codepage style encoding should be used?

The DOS LFN API works with code page encoded strings.

Regards,
Christian

--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] ASCII to unicode table

2010-11-30 Thread Steve Nickolas
On Wed, 1 Dec 2010, Eric Auer wrote:

> Compatible apps would be apps which only display ASCII out
> of themselves and which make no serious assumptions about
> one byte being equal to one character. A good example are
> MORE and TYPE: If you TYPE an UTF8 text with a special CON
> driver which expects and renders UTF8, it will simply work
> because TYPE passes the text file 1:1 and only uses plain
> ASCII for built-in messages, if any. A good counter example
> are PG and EDIT: They make the byte-is-character assumption
> for scrolling (in particular horizontal scrolling) and EDIT
> uses block graphics chars of codepages. So you have to put
> your CON driver in NON-Unicode mode while using EDIT or PG.

Kind-of like "chev us" or "chev jp" in DOS/V.

> Or is the idea to have "Unicode everywhere", even in the
> PrintScreen hotkey, TREE, Undelete, the volume label for
> SYS / FORMAT / VOL / LABEL, tools like FIND or DEBUG...?

Prolly.  Thought for PrtSc, isn't that what GRAPHICS is for?

-uso.

--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] ASCII to unicode table

2010-11-30 Thread Eric Auer

Hi Christian,

>> Should the translation be "accurate" or should it be "useful"?

That depends a lot on which languages we are talking about.

For the DISPLAYING of already existing strings such as file
names on some USB stick made by somebody using Linux, MacOS
or Windows, if your language is "something latin", you can
get reasonable results with a simplified display which just
drops accents from characters if your current codepage does
not have the needed accented char but has a similar char.

If you try the same with Russian, you will at least have to
switch to a Cyrillic codepage or maybe have both active at
the same time (VGA supports dual codepages: 512 chars). But
if our imaginary USB stick contains the Anime collection of
your Japanese friend, any attempt to display the file names
in any western or Cyrillic codepage will look really bad.

In the other direction, you may want to GENERATE strings in
Unicode. Of course KEYB, MKEYB and similar support switched
and local codepages. I assume that DOSLFN, KEYB and DISPLAY
can signal each other to let you use a suitable layout and
codepage to give your files Cyrillic names, display them in
the right way and read/write file names as UTF8 on your USB
stick... Somebody should check the documentation for more
details ;-). Yet again, try the same with ASIAN languages:

You would need an Input Method driver which lets you type
complex key sequences or combinations to type in a language
which has more than the usual few dozen chars of alphabet.

For CJK languages, you typically also need a wide font, the
usual 8 or 9 pixels of width will not usually be enough. So
you probably end up using a graphics mode CON driver or any
similar system, probably with a relatively big font with at
least 100s of different character shapes in RAM, maybe XMS.

> UTF-8 is independent of byte-order. The exact encoding (and byte-order)
> should always either be implicit (in the interface's or format's
> definition) or be marked in some way. The definition of a string's length
> (possibly number of bytes/words/dwords, number of code-points, number of
> "characters") need not be addressed by such an interface. If there is a
> need for a buffer or string length (see below) a new interface should just
> define that all "length" fields/parameters give the length in bytes.

I would also vote for UTF8: It keeps ASCII strings unchanged
and strings with only a few non-ASCII chars will only get a
few bytes longer, e.g. strings with accented chars in them.

In addition, you get a sort of graceful degradation: Tools
which are not Unicode-aware would treat the strings as if
they use some unknown codepage. So such tools would think
that AndrXX where XX is an encoding for an accented e has 6
characters but at least you can still see the "Andr" in it.

In the other direction, if you accidentally put in a text
with Latin1 or codepage 858 / 850 encoding, you get AndrY
where Y is the codepage style encoding of the accented "e"
and the Y and possibly one char after it would be shown in
a broken way by a CON driver which expects UTF8 instead.



As you already say, for BETTER compatibility, you always
have to be aware whether or not your string uses UTF8 or
codepage encoding. In theory you could also support DBCS
or UTF16-LE or similar, but I would vote against those.

This awareness will mean that you know how to RENDER the
string (e.g. switch fonts or mode of CON driver or use a
built-in rendering as in Blocek) and how many CHARACTERS
and BYTES the string is long and what is ONE CHARACTER,
for example for sorting or when you replace/edit a char.

As said, UTF8 has relatively graceful degradation, but
you still want explicit support for more heavy uses like
text editors, playlists, file managers and similar :-)

I do not understand the "codepoints are 24 bit numbers"
issue. Unicode chars with numbers above 65535 are very
exotic in everyday languages so I would not even start
to support them in DOS. If you mean UTF8, then what you
get is 2 bytes for characters from U+0080 to U+07ff and
3 bytes for characters from U+0800 to U+ - so only
for chars with numbers above 65535 you would need 4 or
even more bytes to UTF8 encode one character :-)

> define what Unicode encoding to use (UTF-8, -16BE, -16LE, -32BE, -32LE)

Luckily UTF8 is quite common and compact and byte order
independent. I think Mac / Office sometimes might use
one of the UTF16 encodings but otherwise they are not
so widespread. The UTF32 encodings are even VERY rare.

> apps have to figure out on their own what encoding their data uses.

That hopefully only affects text editors ;-)

Regards, Eric


--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes a

Re: [Freedos-devel] ASCII to unicode table

2010-11-30 Thread Eric Auer

Hi Christian,

> Just noticing that this grows quite large. If someone finds this
> unbearable for this list, please speak up to let me know I should cut down
> the off-topic stuff on my public mails!

No problem :-) I would hope that people talk more about the
"big font" approaches - Having either a big Unicode font in
XMS or maybe a 512 char double code page in the VGA card...

Combined with, for example, a UTF-8 enabled Super-NANSI to
make the step from strings to their display, of course. The
problem would be loss of "ASCII" art block graphics in apps
which are not using Unicode. A possible workaround would be
dosver-style, to make a per-app decision who uses Unicode.



I do not think that you could trust the data for this. Even
on Linux where Unicode is quite common now, usage of BOM is
rare. People try to keep their set of apps consistent to use
either UTF-8 everywhere or Latin1 everywhere or (preferred)
use whichever the LANG etc environment variables select at
the moment when the app starts. Given that DOS has many old
unmaintained apps, you will have to accept mixing in DOS:

Some old apps will only use ASCII anyway which is the same
for real ASCII and for UTF8 but some others will assume a
codepage (often 437) to be active. The block graphics and
other chars from the non-ASCII half of any codepage differ
in encoding from UTF8 so, as said, any display or similar
driver would need some way to switch between "classic code
page mode" and "UTF8 rendering mode". It could switch on
UTF8 based on explicit request from a modern app or based
on app name for old but known compatible apps... It would
switch off UTF8 when any app exits (int 21.4c / 21.31...).

Compatible apps would be apps which only display ASCII out
of themselves and which make no serious assumptions about
one byte being equal to one character. A good example are
MORE and TYPE: If you TYPE an UTF8 text with a special CON
driver which expects and renders UTF8, it will simply work
because TYPE passes the text file 1:1 and only uses plain
ASCII for built-in messages, if any. A good counter example
are PG and EDIT: They make the byte-is-character assumption
for scrolling (in particular horizontal scrolling) and EDIT
uses block graphics chars of codepages. So you have to put
your CON driver in NON-Unicode mode while using EDIT or PG.



As a general question - I would really like to know for
WHICH APPS people want to have Unicode support. Is this
only about proper display of playlists in MPXPLAY and of
CD, USB or local accented filenames in any file manager?

Is the issue also in general command.com style activity,
probably depending on DOSLFN being present? If yes, I do
assume that the LFN API already is explicit about whether
UTF8 or rather codepage style encoding should be used?

Are text editors also a case which should support Unicode
and if yes, why do you not use for example Blocek then?

Or is the idea to have "Unicode everywhere", even in the
PrintScreen hotkey, TREE, Undelete, the volume label for
SYS / FORMAT / VOL / LABEL, tools like FIND or DEBUG...?

Eric


--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] ASCII to unicode table

2010-11-30 Thread Christian Masloch
> I think your attitude is not very constructive. We have to keep this  
> idea as simple as possible or nobody implements it.

I think some of that is important, even if you only want to implement a  
simple translation. Besides, of course it isn't very constructive to  
*discuss* an idea. Go use DOSLFN's source (free/PD) and implement an  
interface if you want to be constructive, should be enough pointers here  
by now.

> I think it is not needed to make tables UNICODE to ASCII.
> It is sufficient to make ASCII to UNICODE.

Please be specific, I think what you are saying is not what you mean.

I assume that when you say "ASCII" you mean "current code page" because  
ASCII to Unicode (and the reverse) translation doesn't require any table  
at all. Strictly, the ASCII contains a set of 128 codes - these all have  
the same numeric value as the associated Unicode code-points.

You might be proposing that the implementation should be, as Bret put it,  
"accurate" - ie it should only map exact matches, ignoring "pairs of  
characters that look similar enough" (Bret's "useful"). The literal sense  
of your words is that the implementation should be unable (!) to look up  
what a particular Unicode code-point should be mapped to in the current  
code page (only accurate matches). This is undesirable as it would  
unnecessarily hinder many applications.

> Simple table - on one side 256 bytes - on second side 256 words.
> That is all.

You actually need only 128 words for what you have in mind - the lower 128  
word table entries can be dropped, because the ASCII characters/bytes  
always map directly to Unicode code-points. The byte table (containing the  
associated byte in the current code page) can be dropped entirely, because  
its contents will just count upward.

(That table format matches what DOSLFN uses for simple (256-character)  
code pages. DBCS mapping needs to be a lot more complicated. Though you  
might not care, I suggest one consult DOSLFN's source if one is interested  
in DBCS mapping.)

As I mentioned, with a table consisting of (16-bit) words for the Unicode  
side you cannot map all Unicode code-points. Granted, this is not very  
important in practice.

Regards,
Christian

--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] ASCII to unicode table

2010-11-30 Thread Ladislav Lacina
I think your attitude is not very constructive. We have to keep this idea as 
simple as possible or nobody implements it.
I think it is not needed to make tables UNICODE to ASCII.
It is sufficient to make ASCII to UNICODE.

Simple table - on one side 256 bytes - on second side 256 words.
That is all. 

--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] ASCII to unicode table

2010-11-30 Thread Christian Masloch
> UniCode is not the panacea it's purported to be.

No, but you have to give them that it's certainly an improvement.

>> UTF-8 is independent of byte-order. The exact encoding (and byte-order)
>> should always either be implicit (in the interface's or format's
>> definition) or be marked in some way.
>
> I don't think there is a way to automatically determine the encoding from
> the data itself,

Yes, you cannot reliably automatically determine encoding. That's why I  
said you should *know* what data you deal with. (Automatic determination  
of encoding is a serious problem in dealing with plain text files, but  
that need not concern a kernel code translation interface such as the one  
I have in mind.)

> and the only way to determine the byte-order (assuming it's
> not UTF-8, not a single character, and is unknown from the context) is to
> include the special BOM (Byte Order Mark) character as the first  
> character
> of the string.

Yes.

> In fact, according to the UniCode spec, if the BOM is not
> included and the byte-order is not clear from the context, you're  
> supposed
> to assume big-endian.

I don't know about that. But I guess that is the case if you say so.

> For file system and similar applications, the interface could just always
> assume a specific format (probably either UTF-8 or UTF-16LE).

Yes. For example, the (in)famous FAT "long" file names are stored in  
UTF-16LE. Their length is determined by their ASCIZ ("UTF-16LZ") nature ie  
they are terminated by a 16-bit word of the value zero.

If a file system interface (such as Int21/Int21.71) was to be made  
Unicode-capable I would probably use UTF-8. (Particularly because of the  
ASCII compatibility, where only characters >= 80h ("codepage-dependent" so  
to speak) represent code-points >= U+0080.)

> For a
> general-purpose interface, though, you should be able to handle all
> different kinds of possibilities (including things like "UTF-24" and
> "UTF-64").

UTF-24 would be pretty funny. (FAT24 is an actual idea I had. Would work  
well enough.) Even theoretically, UTF-64 doesn't make a lot of sense: a  
24-bit (let alone 32-bit) encoding can already represent more values than  
are currently reserved for all Unicode code-points. Alignment of each  
single code-point is no particularly good reason to unnecessarily double  
(you might speak of "bloat" (-; ) the space required to store any given  
string. 64-bit alignment of the whole string can still be achieved by  
storing an unused dword behind the actual string if it contains an odd  
number of dwords; accesses can be aligned by always accessing a whole  
qword then selecting the appropriate dword and discarding the other.

> Also, even though you're dealing with DOS doesn't necessarily
> mean everything will be little-endian -- it depends on the source of the
> data.  Certain hardware interfaces (like SCSI) are inherently big-endian,
> and data downloaded from external sources can be almost anything.

Yeah.

> Another possibility is what my UNI2ASCI program does, which is accept
> strings terminated with a specific character (in my case, the UniCode NUL
> character, conceptually similar to ASCIIZ).  A general-purpose program
> should provide more than one way to define a string's length.

I guess specifying the length in bytes is good enough. If you want to  
provide such an interface NUL-terminated (or CP/M-style dollar-terminated  
(-; ) strings, write a wrapper function which counts the number of non-NUL  
bytes/words/tri-bytes/dwords/qwords before passing the string to that  
interface. For non-UTF-8 Unicode encodings, a number of bytes not  
divisible by the length of the expected units (2, 3, 4, 8) could just  
cause an error.

Generally speaking, error handling is important. Correct UTF-8 validation  
isn't pretty though.

> If you limit
> input to only certain encodings or byte-orders or string/character types,
> then it ceases to be "general-purpose".  Maybe a general-purpose program  
> is
> not what we're really talking about here, but I think one needs to be
> developed.

Yes, yes. I don't think a general-purpose translation program is what was  
initially suggested (correct me though).

Regards,
Christian

Just noticing that this grows quite large. If someone finds this  
unbearable for this list, please speak up to let me know I should cut down  
the off-topic stuff on my public mails!

--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] ASCII to unicode table

2010-11-30 Thread BretJ


Christian Masloch wrote:
> 
> I think it should be accurate for file systems. Such a "useful"
> translation is a good concept for displaying output (maybe even that of
> the DIR command) but not for actually working with the file system. 
> Keyboard input can't map one key to several characters at once (unless you
> randomly (-; decide which one to use) so input handling should use
> one-to-one translation too.
> 

Agreed.  Just further fuel to the fire that both types of translations are
needed (depending on the specific application, even if the application is
"the kernel"), and that this is not a trivial matter.  UniCode is not the
panacea it's purported to be.


Christian Masloch wrote:
> 
> UTF-8 is independent of byte-order. The exact encoding (and byte-order)
> should always either be implicit (in the interface's or format's
> definition) or be marked in some way.
> 

I don't think there is a way to automatically determine the encoding from
the data itself, and the only way to determine the byte-order (assuming it's
not UTF-8, not a single character, and is unknown from the context) is to
include the special BOM (Byte Order Mark) character as the first character
of the string.  In fact, according to the UniCode spec, if the BOM is not
included and the byte-order is not clear from the context, you're supposed
to assume big-endian.

For file system and similar applications, the interface could just always
assume a specific format (probably either UTF-8 or UTF-16LE).  For a
general-purpose interface, though, you should be able to handle all
different kinds of possibilities (including things like "UTF-24" and
"UTF-64").  Also, even though you're dealing with DOS doesn't necessarily
mean everything will be little-endian -- it depends on the source of the
data.  Certain hardware interfaces (like SCSI) are inherently big-endian,
and data downloaded from external sources can be almost anything.


Christian Masloch wrote:
> 
> The definition of a string's length (possibly number of
> bytes/words/dwords, number of code-points, number of "characters") need
> not be addressed by such an interface. If there is a need for a buffer or
> string length (see below) a new interface should just define that all
> "length" fields/parameters give the length in bytes.
> 

Another possibility is what my UNI2ASCI program does, which is accept
strings terminated with a specific character (in my case, the UniCode NUL
character, conceptually similar to ASCIIZ).  A general-purpose program
should provide more than one way to define a string's length.  If you limit
input to only certain encodings or byte-orders or string/character types,
then it ceases to be "general-purpose".  Maybe a general-purpose program is
not what we're really talking about here, but I think one needs to be
developed.

Bret
-- 
View this message in context: 
http://old.nabble.com/ASCII-to-unicode-table-tp3031p30341668.html
Sent from the FreeDOS - Dev mailing list archive at Nabble.com.


--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] ASCII to unicode table

2010-11-30 Thread Christian Masloch
> Should the translation be "accurate" or should it be "useful"?

I think it should be accurate for file systems. Such a "useful"  
translation is a good concept for displaying output (maybe even that of  
the DIR command) but not for actually working with the file system.  
Keyboard input can't map one key to several characters at once (unless you  
randomly (-; decide which one to use) so input handling should use  
one-to-one translation too.

> From a technical perspective, you will also at a minimum need to concern
> yourself with translating strings vs. translating single characters  
> (UniCode
> strings can/should include an Endian-defining character at the  
> beginning, as
> well as needing to define how the length of the string is determined),  
> UTF-8
> vs. UTF-16 vs. UTF-32, and Big- vs. Little-endian.  None of this is  
> trivial,
> and I think this is WAY too complicated to be in the kernel -- it should  
> be
> a separate program/driver.

UTF-8 is independent of byte-order. The exact encoding (and byte-order)  
should always either be implicit (in the interface's or format's  
definition) or be marked in some way. The definition of a string's length  
(possibly number of bytes/words/dwords, number of code-points, number of  
"characters") need not be addressed by such an interface. If there is a  
need for a buffer or string length (see below) a new interface should just  
define that all "length" fields/parameters give the length in bytes.

If there was a DOS (kernel) interface, it should probably accept a single  
character (usually one byte, two byte for DBCS) encoded in the currently  
selected code page and return a Unicode code-point. All code-points fit  
into a 24-bit (= 3-byte) number; though such an interface can be limited  
to Unicode's BMP (16-bit numbers (= words)) like the DOSLFN/VC tables. Of  
course there should be an "accurate" reverse interface which accepts a  
24-bit (or 16-bit) number and returns a one- or two-byte character in the  
current code page if one exists for that Unicode code-point.

Notably, some code pages might contain characters that should map to  
several code-points and some code-points might require more than two bytes  
when represented in the current code page's encoding. A string translation  
interface might therefore be more appropriate. (As an aside, this would  
solve the need for a DBCS kludge because multi-byte mappings could be  
supported intrinsically.) In this case, the interface should exactly  
define what Unicode encoding to use (UTF-8, -16BE, -16LE, -32BE, -32LE) -  
applications have to figure out on their own what encoding their data uses.

Regards,
Christian

--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] ASCII to unicode table

2010-11-29 Thread BretJ

I think there's an even larger question than the technical implementation, in
summary: Should the translation be "accurate" or should it be "useful"?

Officially, I believe there is a precise one-to-one relationship between
ASCII and Unicode, but there are dozens of Unicode characters that "look
like" each ASCII character.

In my UNI2ASCI program (included with my USB drivers) the translation
tables, perhaps, go overboard.  If it receives a UniCode character that
looks (to me) to be close enough to one of the ASCII characters, or a
"string" of ASCII characters, that I think it can be "reasonably"
represented on screen, it gets translated.  UNI2ASCI only translates one way
(UniCode to ASCII), only works with Code Page 437, and is not one-to-one (a
single UniCode character may be translated into a "string" of ASCII
characters).

Keeping this type of translation table totally in memory is probably
impractical because of the amount of memory that would be needed.  However,
I think this type of translation should at least be an option available to
the user.

***

>From a technical perspective, you will also at a minimum need to concern
yourself with translating strings vs. translating single characters (UniCode
strings can/should include an Endian-defining character at the beginning, as
well as needing to define how the length of the string is determined), UTF-8
vs. UTF-16 vs. UTF-32, and Big- vs. Little-endian.  None of this is trivial,
and I think this is WAY too complicated to be in the kernel -- it should be
a separate program/driver.

Bret
-- 
View this message in context: 
http://old.nabble.com/ASCII-to-unicode-table-tp3031p30335092.html
Sent from the FreeDOS - Dev mailing list archive at Nabble.com.


--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] ASCII to unicode table

2010-11-27 Thread Steve Nickolas
On Sat, 27 Nov 2010, Eric Auer wrote:

> You could even have a separately loaded CON driver that
> keeps a full unicode font in XMS (with some caching of
> recently used sections in faster memory maybe?).

That would be something like what DOS/V does.  It switches to VGA 640x480 
mode and emulates the standard 80x25 console (with a larger 8x19/16x19 
font), with the font stored in XMS.  It would be quite difficult, I 
suppose, to implement, and all the TUI software would break unless it was 
coded to expect the possibility of a console being run in Unicode mode, so 
it would be necessary to be able to turn it on and off at will (again, 
like DOS/V).

-uso.

--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] ASCII to unicode table

2010-11-27 Thread Christian Masloch
> Now programs do it self by looking into own datafiles with .TBL  
> extension. Look at DOSLFN or Volkov commander 4.99. They have few files  
> like cp852uni.tbl, cp866uni.tbl and so on.
> It is a very good solution but problem is that here is no way now how to  
> determine which file should be used.

At least DOSLFN queries DOS for the currently used codepage and tries to  
load that table. This query is in its Int21 handler so it will catch  
codepage changes and try to load the new table then.

> It fully relies on manual configuration.

No.

> Anothor point is that ASCII-unicode conversion should be somewhat  
> treated by OS, I think. I think it is not smart if every unicode program  
> has own TBL library. It should be one somewhere in FREEDOS derectories.

Yes.

> So how to solve it?
> * let the user call function for international info, and by returned  
> codepage manualy decide which .TBL file to use?

As currently done by DOSLFN.

> * .TBL files should be in LANG or NLSPATH environment variable?

A centralized location might be useful. It might also be possible to  
create a file format where several tables share one file. I think such a  
format could be a COUNTRY.SYS extension without breaking other users of  
that file.

> * somehow extend the kernel function for international info to say which  
> .TBL files to use?
> * preload .TBL into memory in COUNTRY initialization and even more  
> extend international info to provide ASCII-  unicode conversion?

Both would be useful. Such a table (if limited to Unicode's BMP, as  
DOSLFN's format currently is) needs 256 byte plus some info like what  
codepage the currently loaded table corresponds to.

Regards,
Christian

--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] ASCII to unicode table

2010-11-27 Thread Eric Auer

Hi Ladislav,

> I think we should discuss how to implement unicode.

There already is some interface for double byte chars
in DOS, which we could implement. However, it was made
for Chinese as far as I remember and needed support by
more drivers even if you had a DBCS-enabled DOS version.

> In the fact only one small thing is necesarry: we need a mechanism
> for translating unicode chars into ASCII chars and vice versa.

Technically speaking, that translation is "char 0-7f of
unicode are ASCII, the rest are not". What you probably
mean is "for any _other_ unicode char, if char 80-ff of
the current font codepage looks similar enough, display
that char"... Which is very limited, given that unicode
has 1s of chars while any codepage only has max 128
non-ASCII chars for you.

You could also support unicode  for strings which are
relevant for DOS... That would probably mean that you
allow UTF8 in filenames. You could even use it without
changing the kernel, if it is okay for you that search
wildcards match a byte and not necessarily a character.

The rest would depend on the ability of your CON driver
to show UTF8 properly as far as the current font allows.

You could even have a separately loaded CON driver that
keeps a full unicode font in XMS (with some caching of
recently used sections in faster memory maybe?).

Note that many programs do not use CON, in particular
if they want to have user interfaces with fancy layout.
For example text editors do not normally use CON in DOS
but you could have one which uses CON and needs NANSI.
Actually you would want an UTF8-enabled super NANSI :-)

> Now programs do it self by looking into own datafiles with .TBL
> extension. Look at DOSLFN or Volkov commander 4.99. They have few
> files like cp852uni.tbl, cp866uni.tbl and so on.

As said above, that only allows you to display very few
unicode chars - those which happen to be supported by
your current codepage font. Still useful, of course. Be
aware that UTF8 or unicode in general needs more bytes
per character, so outside the LFN world, file names can
reach their limit at less than 8+3 chars. But then, it
is easy to load DOSLFN.

> It is a very good solution but problem is that here is no
 > way now how to determine which file should be used.

There is. DISPLAY has an interface to query the codepage.

> It fully relies on manual configuration.

See 2 lines above this.

> Anothor point is that ASCII-unicode conversion should be somewhat
> treated by OS, I think. I think it is not smart if every unicode
> program has own TBL library. It should be one somewhere in FREEDOS
> derectories.

See above - but you could have some translation service.
You could even have that UTF8 super NANSI described above
but your soft then needs to understand the PRINCIPLE of
UTF8. In other words, it has to understand in which way
a sequence of two or more bytes can still mean only one
character, which can be important for layout and search.

> So how to solve it?

> * let the user call function for international info, and by returned
> codepage manualy decide which .TBL file to use?

Such functions are available, yes.

> * .TBL files should be in LANG or NLSPATH environment variable?

Probably better to have a new variable for those, if any.

> * somehow extend the kernel function for international info to say
> which .TBL files to use?

I would not put that in the kernel. Better in a driver.

> * preload .TBL into memory in COUNTRY initialization and even more
> extend international info to provide ASCII-  unicode conversion?

As above, if anything, this should be handled by a driver.

Regards, Eric



--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


[Freedos-devel] ASCII to unicode table

2010-11-27 Thread Ladislav Lacina
I think we should discuss how to implement unicode.

In the fact only one small thing is necesarry: we need a mechanism for 
translating unicode chars into ASCII chars and vice versa.
Now programs do it self by looking into own datafiles with .TBL extension. Look 
at DOSLFN or Volkov commander 4.99. They have few files like cp852uni.tbl, 
cp866uni.tbl and so on.
It is a very good solution but problem is that here is no way now how to 
determine which file should be used.
It fully relies on manual configuration.
Anothor point is that ASCII-unicode conversion should be somewhat treated by 
OS, I think. I think it is not smart if every unicode program has own TBL 
library. It should be one somewhere in FREEDOS derectories.

So how to solve it?
* let the user call function for international info, and by returned codepage 
manualy decide which .TBL file to use?
* .TBL files should be in LANG or NLSPATH environment variable?
* somehow extend the kernel function for international info to say which .TBL 
files to use?
* preload .TBL into memory in COUNTRY initialization and even more extend 
international info to provide ASCII-  unicode conversion?

--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel