Re: [Freedos-devel] UTF-8 input and output

2019-06-01 Thread Steve Nickolas

On Sat, 1 Jun 2019, TK Chia wrote:


Hello Steve Nickolas,


May I know what particular method(s) and data you use to do the
conversions?  I was thinking that something like POSIX iconv( ) will
come in useful, but Open Watcom does not seem to have such a function.

I had some function that went through a byte at a time.  I think it came
from an old version of
https://github.com/benkasminbullock/unicode-c/blob/master/unicode.c
I had it output to words, and then translated the words to bytes using a
lookup table.


Thanks!  It seems to me that the main difficulty --- in terms of keeping
things fast _and_ small on an IBM PC --- will be in translating from
UCS-2 to the native codepage.

Thank you!

--
https://github.com/tkchia


___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel



Here's the UCS-2 translation tables I did:

static unsigned int cp437[256]={
 0x2007, 0x263A, 0x263B, 0x2665, 0x2666, 0x2663, 0x2660, 0x0007,
 0x25D8, 0x25CB, 0x000A, 0x2642, 0x2640, 0x000D, 0x266B, 0x263C,
 0x25BA, 0x25C4, 0x2195, 0x203C, 0x00B6, 0x00A7, 0x25AC, 0x21A8,
 0x2191, 0x2193, 0x2192, 0x2190, 0x221F, 0x2194, 0x25B2, 0x25BC,
 0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027,
 0x0028, 0x0029, 0x002A, 0x002B, 0x002C, 0x002D, 0x002E, 0x002F,
 0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037,
 0x0038, 0x0039, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F,
 0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047,
 0x0048, 0x0049, 0x004A, 0x004B, 0x004C, 0x004D, 0x004E, 0x004F,
 0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057,
 0x0058, 0x0059, 0x005A, 0x005B, 0x005C, 0x005D, 0x005E, 0x005F,
 0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067,
 0x0068, 0x0069, 0x006A, 0x006B, 0x006C, 0x006D, 0x006E, 0x006F,
 0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077,
 0x0078, 0x0079, 0x007A, 0x007B, 0x007C, 0x007D, 0x007E, 0x2302,
 0x00C7, 0x00FC, 0x00E9, 0x00E2, 0x00E4, 0x00E0, 0x00E5, 0x00E7,
 0x00EA, 0x00EB, 0x00E8, 0x00EF, 0x00EE, 0x00EC, 0x00C4, 0x00C5,
 0x00C9, 0x00E6, 0x00C6, 0x00F4, 0x00F6, 0x00F2, 0x00FB, 0x00F9,
 0x00FF, 0x00D6, 0x00DC, 0x00A2, 0x00A3, 0x00A5, 0x20A7, 0x0192,
 0x00E1, 0x00ED, 0x00F3, 0x00FA, 0x00F1, 0x00D1, 0x00AA, 0x00BA,
 0x00BF, 0x2310, 0x00AC, 0x00BD, 0x00BC, 0x00A1, 0x00AB, 0x00BB,
 0x2591, 0x2592, 0x2593, 0x2502, 0x2524, 0x2561, 0x2562, 0x2556,
 0x2555, 0x2563, 0x2551, 0x2557, 0x255D, 0x255C, 0x255B, 0x2510,
 0x2514, 0x2534, 0x252C, 0x251C, 0x2500, 0x253C, 0x255E, 0x255F,
 0x255A, 0x2554, 0x2569, 0x2566, 0x2560, 0x2550, 0x256C, 0x2567,
 0x2568, 0x2564, 0x2565, 0x2559, 0x2558, 0x2552, 0x2553, 0x256B,
 0x256A, 0x2518, 0x250C, 0x2588, 0x2584, 0x258C, 0x2590, 0x2580,
 0x03B1, 0x00DF, 0x0393, 0x03C0, 0x03A3, 0x03C3, 0x03B5, 0x03C4,
 0x03A6, 0x0398, 0x03A9, 0x03B4, 0x221E, 0x03C6, 0x03B4, 0x2229,
 0x2261, 0x00B1, 0x2265, 0x2264, 0x2320, 0x2321, 0x00F7, 0x2248,
 0x00B0, 0x2219, 0x00B7, 0x221A, 0x207F, 0x00B2, 0x25A0, 0x00A0
};

/* 850 and 852 translation tables are official from 0x80-0xFF */
static unsigned int cp850[256]={
 0x2007, 0x263A, 0x263B, 0x2665, 0x2666, 0x2663, 0x2660, 0x0007,
 0x25D8, 0x25CB, 0x000A, 0x2642, 0x2640, 0x000D, 0x266B, 0x263C,
 0x25BA, 0x25C4, 0x2195, 0x203C, 0x00B6, 0x00A7, 0x25AC, 0x21A8,
 0x2191, 0x2193, 0x2192, 0x2190, 0x221F, 0x2194, 0x25B2, 0x25BC,
 0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027,
 0x0028, 0x0029, 0x002A, 0x002B, 0x002C, 0x002D, 0x002E, 0x002F,
 0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037,
 0x0038, 0x0039, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F,
 0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047,
 0x0048, 0x0049, 0x004A, 0x004B, 0x004C, 0x004D, 0x004E, 0x004F,
 0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057,
 0x0058, 0x0059, 0x005A, 0x005B, 0x005C, 0x005D, 0x005E, 0x005F,
 0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067,
 0x0068, 0x0069, 0x006A, 0x006B, 0x006C, 0x006D, 0x006E, 0x006F,
 0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077,
 0x0078, 0x0079, 0x007A, 0x007B, 0x007C, 0x007D, 0x007E, 0x2302,
 0x00c7, 0x00fc, 0x00e9, 0x00e2, 0x00e4, 0x00e0, 0x00e5, 0x00e7,
 0x00ea, 0x00eb, 0x00e8, 0x00ef, 0x00ee, 0x00ec, 0x00c4, 0x00c5,
 0x00c9, 0x00e6, 0x00c6, 0x00f4, 0x00f6, 0x00f2, 0x00fb, 0x00f9,
 0x00ff, 0x00d6, 0x00dc, 0x00f8, 0x00a3, 0x00d8, 0x00d7, 0x0192,
 0x00e1, 0x00ed, 0x00f3, 0x00fa, 0x00f1, 0x00d1, 0x00aa, 0x00ba,
 0x00bf, 0x00ae, 0x00ac, 0x00bd, 0x00bc, 0x00a1, 0x00ab, 0x00bb,
 0x2591, 0x2592, 0x2593, 0x2502, 0x2524, 0x00c1, 0x00c2, 0x00c0,
 0x00a9, 0x2563, 0x2551, 0x2557, 0x255d, 0x00a2, 0x00a5, 0x2510,
 0x2514, 0x2534, 0x252c, 0x251c, 0x2500, 0x253c, 0x00e3, 0x00c3,
 0x255a, 0x2554, 0x2569, 0x2566, 0x2560, 0x2550, 0x256c, 0x00a4,
 0x00f0, 0x00d0, 0x00ca, 0x00cb, 0x00c8, 0x0131, 0x00cd, 0x00ce,
 0x00cf, 0x2518, 0x250c, 0x2588, 0x2584, 0x00a6, 0x00cc, 0x2580,
 0x00d3, 0x0

Re: [Freedos-devel] UTF-8 input and output

2019-05-31 Thread TK Chia

Hello Steve Nickolas,


May I know what particular method(s) and data you use to do the
conversions?  I was thinking that something like POSIX iconv( ) will
come in useful, but Open Watcom does not seem to have such a function.

I had some function that went through a byte at a time.  I think it came
from an old version of
https://github.com/benkasminbullock/unicode-c/blob/master/unicode.c
I had it output to words, and then translated the words to bytes using a
lookup table.


Thanks!  It seems to me that the main difficulty --- in terms of keeping
things fast _and_ small on an IBM PC --- will be in translating from
UCS-2 to the native codepage.

Thank you!

--
https://github.com/tkchia


___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] UTF-8 input and output

2019-05-27 Thread Steve Nickolas

On Tue, 28 May 2019, TK Chia wrote:


Hello Steve Nickolas,


My IRC client (which runs fine on a 386/16 at least) internally
translates from UTF-8 -> UCS-2 -> native codepage (usually CP437).


May I know what particular method(s) and data you use to do the
conversions?  I was thinking that something like POSIX iconv( ) will
come in useful, but Open Watcom does not seem to have such a function.


I had some function that went through a byte at a time.  I think it came 
from an old version of 
https://github.com/benkasminbullock/unicode-c/blob/master/unicode.c


I had it output to words, and then translated the words to bytes using a 
lookup table.


-uso.


___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] UTF-8 input and output

2019-05-27 Thread Eric Auer

Hi David,

the DOS way of supporting charsets with more than 256 different
characters ws called DBCS and used only in Asian / CJK countries:

https://en.wikipedia.org/wiki/DBCS

This is not exactly UTF-8. Normally, DOS users configure their
system to use one (or switch between a few) 256 character code
pages (character sets). Of course software for DOS is free to
process more complex data internally, but DOS does not, generally
speaking.

You could make your software interpret UTF-8 internally and
convert to and from the currently active codepage using some
kind of Unicode look up tables. Displaying 1000s of different
characters simultaneously only works with graphics anyway,
but you can at least preserve UTF-8 and display the parts
which can be displayed in the current charset.

It would be interesting to test and implement various DBCS
things with FreeDOS: On top of that, you could implement I/O
libraries which translate between 16-bit DBCS and the parts
of Unicode which can be represented as UTF-16, for example,
with additional support for converting UTF-8 strings to and
from UTF-16 or DBCS.

For DOS as operating system kernel itself, Unicode basically
does not exist, but all applications for DOS are free to
interpret raw data under the assumtion of UTF-8, UTF-16 or
other Unicode encodings. For special cases such as long file
name support, the corresponding drivers already have special
approaches for handling a few Unicode characters beyond ASCII.

As said, for display, you will often need graphics anyway and
that will often mean that it only works within your application.

Regards, Eric

> I'm coordinating a bunch of updates to Frotz[1], including the DOS port.
> One of the big enhancements is UTF-8 support for input and output.

> This would allow effortless support for accented characters and alternate
> alphabets.  We've tested games written for Spanish (diacritical marks)
> and Russian (Cyrillic alphabet).

> So far, I've found absolutely nothing on doing UTF-8 IO on DOS.  Is this
> something that can be done without bogging down an original IBM PC?  How
> would I go about doing it?

> [1] Z-machine interpreter for Infocom games and others.
> See https://gitlab.com/DavidGriffith/frotz/



___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] UTF-8 input and output

2019-05-27 Thread TK Chia

Hello Steve Nickolas,


My IRC client (which runs fine on a 386/16 at least) internally
translates from UTF-8 -> UCS-2 -> native codepage (usually CP437).


May I know what particular method(s) and data you use to do the
conversions?  I was thinking that something like POSIX iconv( ) will
come in useful, but Open Watcom does not seem to have such a function.

Thank you!

--
https://github.com/tkchia


___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] UTF-8 input and output

2019-05-27 Thread Steve Nickolas

On Mon, 27 May 2019, David Griffith wrote:



I'm coordinating a bunch of updates to Frotz[1], including the DOS port. One 
of the big enhancements is UTF-8 support for input and output.  This would 
allow effortless support for accented characters and alternate alphabets. 
We've tested games written for Spanish (diacritical marks) and Russian 
(Cyrillic alphabet).


So far, I've found absolutely nothing on doing UTF-8 IO on DOS.  Is this 
something that can be done without bogging down an original IBM PC?  How 
would I go about doing it?



[1] Z-machine interpreter for Infocom games and others.  See 
https://gitlab.com/DavidGriffith/frotz/





My IRC client (which runs fine on a 386/16 at least) internally translates 
from UTF-8 -> UCS-2 -> native codepage (usually CP437).


-uso.


___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


Re: [Freedos-devel] UTF-8 input and output

2019-05-27 Thread Pär Moberg
Den mån 27 maj 2019 09:38David Griffith  skrev:

>
> I'm coordinating a bunch of updates to Frotz[1], including the DOS port.
> One of the big enhancements is UTF-8 support for input and output.  This
> would allow effortless support for accented characters and alternate
> alphabets.  We've tested games written for Spanish (diacritical marks) and
> Russian (Cyrillic alphabet).
>
> So far, I've found absolutely nothing on doing UTF-8 IO on DOS.  Is this
> something that can be done without bogging down an original IBM PC?  How
> would I go about doing it?
>
>
> [1] Z-machine interpreter for Infocom games and others.  See
> https://gitlab.com/DavidGriffith/frotz/
>
> --
> David Griffith
> d...@661.org
>
> A: Because it fouls the order in which people normally read text.
> Q: Why is top-posting such a bad thing?
> A: Top-posting.
> Q: What is the most annoying thing in e-mail?
>
>
> ___
> Freedos-devel mailing list
> Freedos-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freedos-devel


Because it intrigued me, here is a time line:
MS-DOS 5 was released in 1991
UTF-8 was defined in 1993
MS-DOS 6.22 was released in 1994
About in 2002 was the first Web pages encoded in UTF-8 encountered by
Google.

All this according to Wikipedia.
___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel


[Freedos-devel] UTF-8 input and output

2019-05-27 Thread David Griffith



I'm coordinating a bunch of updates to Frotz[1], including the DOS port. 
One of the big enhancements is UTF-8 support for input and output.  This 
would allow effortless support for accented characters and alternate 
alphabets.  We've tested games written for Spanish (diacritical marks) and 
Russian (Cyrillic alphabet).


So far, I've found absolutely nothing on doing UTF-8 IO on DOS.  Is this 
something that can be done without bogging down an original IBM PC?  How 
would I go about doing it?



[1] Z-machine interpreter for Infocom games and others.  See 
https://gitlab.com/DavidGriffith/frotz/


--
David Griffith
d...@661.org

A: Because it fouls the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?


___
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel