Re: [Freedos-devel] UTF-8 input and output
On Sat, 1 Jun 2019, TK Chia wrote: Hello Steve Nickolas, May I know what particular method(s) and data you use to do the conversions? I was thinking that something like POSIX iconv( ) will come in useful, but Open Watcom does not seem to have such a function. I had some function that went through a byte at a time. I think it came from an old version of https://github.com/benkasminbullock/unicode-c/blob/master/unicode.c I had it output to words, and then translated the words to bytes using a lookup table. Thanks! It seems to me that the main difficulty --- in terms of keeping things fast _and_ small on an IBM PC --- will be in translating from UCS-2 to the native codepage. Thank you! -- https://github.com/tkchia ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel Here's the UCS-2 translation tables I did: static unsigned int cp437[256]={ 0x2007, 0x263A, 0x263B, 0x2665, 0x2666, 0x2663, 0x2660, 0x0007, 0x25D8, 0x25CB, 0x000A, 0x2642, 0x2640, 0x000D, 0x266B, 0x263C, 0x25BA, 0x25C4, 0x2195, 0x203C, 0x00B6, 0x00A7, 0x25AC, 0x21A8, 0x2191, 0x2193, 0x2192, 0x2190, 0x221F, 0x2194, 0x25B2, 0x25BC, 0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029, 0x002A, 0x002B, 0x002C, 0x002D, 0x002E, 0x002F, 0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F, 0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048, 0x0049, 0x004A, 0x004B, 0x004C, 0x004D, 0x004E, 0x004F, 0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058, 0x0059, 0x005A, 0x005B, 0x005C, 0x005D, 0x005E, 0x005F, 0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069, 0x006A, 0x006B, 0x006C, 0x006D, 0x006E, 0x006F, 0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, 0x0078, 0x0079, 0x007A, 0x007B, 0x007C, 0x007D, 0x007E, 0x2302, 0x00C7, 0x00FC, 0x00E9, 0x00E2, 0x00E4, 0x00E0, 0x00E5, 0x00E7, 0x00EA, 0x00EB, 0x00E8, 0x00EF, 0x00EE, 0x00EC, 0x00C4, 0x00C5, 0x00C9, 0x00E6, 0x00C6, 0x00F4, 0x00F6, 0x00F2, 0x00FB, 0x00F9, 0x00FF, 0x00D6, 0x00DC, 0x00A2, 0x00A3, 0x00A5, 0x20A7, 0x0192, 0x00E1, 0x00ED, 0x00F3, 0x00FA, 0x00F1, 0x00D1, 0x00AA, 0x00BA, 0x00BF, 0x2310, 0x00AC, 0x00BD, 0x00BC, 0x00A1, 0x00AB, 0x00BB, 0x2591, 0x2592, 0x2593, 0x2502, 0x2524, 0x2561, 0x2562, 0x2556, 0x2555, 0x2563, 0x2551, 0x2557, 0x255D, 0x255C, 0x255B, 0x2510, 0x2514, 0x2534, 0x252C, 0x251C, 0x2500, 0x253C, 0x255E, 0x255F, 0x255A, 0x2554, 0x2569, 0x2566, 0x2560, 0x2550, 0x256C, 0x2567, 0x2568, 0x2564, 0x2565, 0x2559, 0x2558, 0x2552, 0x2553, 0x256B, 0x256A, 0x2518, 0x250C, 0x2588, 0x2584, 0x258C, 0x2590, 0x2580, 0x03B1, 0x00DF, 0x0393, 0x03C0, 0x03A3, 0x03C3, 0x03B5, 0x03C4, 0x03A6, 0x0398, 0x03A9, 0x03B4, 0x221E, 0x03C6, 0x03B4, 0x2229, 0x2261, 0x00B1, 0x2265, 0x2264, 0x2320, 0x2321, 0x00F7, 0x2248, 0x00B0, 0x2219, 0x00B7, 0x221A, 0x207F, 0x00B2, 0x25A0, 0x00A0 }; /* 850 and 852 translation tables are official from 0x80-0xFF */ static unsigned int cp850[256]={ 0x2007, 0x263A, 0x263B, 0x2665, 0x2666, 0x2663, 0x2660, 0x0007, 0x25D8, 0x25CB, 0x000A, 0x2642, 0x2640, 0x000D, 0x266B, 0x263C, 0x25BA, 0x25C4, 0x2195, 0x203C, 0x00B6, 0x00A7, 0x25AC, 0x21A8, 0x2191, 0x2193, 0x2192, 0x2190, 0x221F, 0x2194, 0x25B2, 0x25BC, 0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029, 0x002A, 0x002B, 0x002C, 0x002D, 0x002E, 0x002F, 0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F, 0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048, 0x0049, 0x004A, 0x004B, 0x004C, 0x004D, 0x004E, 0x004F, 0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058, 0x0059, 0x005A, 0x005B, 0x005C, 0x005D, 0x005E, 0x005F, 0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069, 0x006A, 0x006B, 0x006C, 0x006D, 0x006E, 0x006F, 0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, 0x0078, 0x0079, 0x007A, 0x007B, 0x007C, 0x007D, 0x007E, 0x2302, 0x00c7, 0x00fc, 0x00e9, 0x00e2, 0x00e4, 0x00e0, 0x00e5, 0x00e7, 0x00ea, 0x00eb, 0x00e8, 0x00ef, 0x00ee, 0x00ec, 0x00c4, 0x00c5, 0x00c9, 0x00e6, 0x00c6, 0x00f4, 0x00f6, 0x00f2, 0x00fb, 0x00f9, 0x00ff, 0x00d6, 0x00dc, 0x00f8, 0x00a3, 0x00d8, 0x00d7, 0x0192, 0x00e1, 0x00ed, 0x00f3, 0x00fa, 0x00f1, 0x00d1, 0x00aa, 0x00ba, 0x00bf, 0x00ae, 0x00ac, 0x00bd, 0x00bc, 0x00a1, 0x00ab, 0x00bb, 0x2591, 0x2592, 0x2593, 0x2502, 0x2524, 0x00c1, 0x00c2, 0x00c0, 0x00a9, 0x2563, 0x2551, 0x2557, 0x255d, 0x00a2, 0x00a5, 0x2510, 0x2514, 0x2534, 0x252c, 0x251c, 0x2500, 0x253c, 0x00e3, 0x00c3, 0x255a, 0x2554, 0x2569, 0x2566, 0x2560, 0x2550, 0x256c, 0x00a4, 0x00f0, 0x00d0, 0x00ca, 0x00cb, 0x00c8, 0x0131, 0x00cd, 0x00ce, 0x00cf, 0x2518, 0x250c, 0x2588, 0x2584, 0x00a6, 0x00cc, 0x2580, 0x00d3, 0x0
Re: [Freedos-devel] UTF-8 input and output
Hello Steve Nickolas, May I know what particular method(s) and data you use to do the conversions? I was thinking that something like POSIX iconv( ) will come in useful, but Open Watcom does not seem to have such a function. I had some function that went through a byte at a time. I think it came from an old version of https://github.com/benkasminbullock/unicode-c/blob/master/unicode.c I had it output to words, and then translated the words to bytes using a lookup table. Thanks! It seems to me that the main difficulty --- in terms of keeping things fast _and_ small on an IBM PC --- will be in translating from UCS-2 to the native codepage. Thank you! -- https://github.com/tkchia ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] UTF-8 input and output
On Tue, 28 May 2019, TK Chia wrote: Hello Steve Nickolas, My IRC client (which runs fine on a 386/16 at least) internally translates from UTF-8 -> UCS-2 -> native codepage (usually CP437). May I know what particular method(s) and data you use to do the conversions? I was thinking that something like POSIX iconv( ) will come in useful, but Open Watcom does not seem to have such a function. I had some function that went through a byte at a time. I think it came from an old version of https://github.com/benkasminbullock/unicode-c/blob/master/unicode.c I had it output to words, and then translated the words to bytes using a lookup table. -uso. ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] UTF-8 input and output
Hi David, the DOS way of supporting charsets with more than 256 different characters ws called DBCS and used only in Asian / CJK countries: https://en.wikipedia.org/wiki/DBCS This is not exactly UTF-8. Normally, DOS users configure their system to use one (or switch between a few) 256 character code pages (character sets). Of course software for DOS is free to process more complex data internally, but DOS does not, generally speaking. You could make your software interpret UTF-8 internally and convert to and from the currently active codepage using some kind of Unicode look up tables. Displaying 1000s of different characters simultaneously only works with graphics anyway, but you can at least preserve UTF-8 and display the parts which can be displayed in the current charset. It would be interesting to test and implement various DBCS things with FreeDOS: On top of that, you could implement I/O libraries which translate between 16-bit DBCS and the parts of Unicode which can be represented as UTF-16, for example, with additional support for converting UTF-8 strings to and from UTF-16 or DBCS. For DOS as operating system kernel itself, Unicode basically does not exist, but all applications for DOS are free to interpret raw data under the assumtion of UTF-8, UTF-16 or other Unicode encodings. For special cases such as long file name support, the corresponding drivers already have special approaches for handling a few Unicode characters beyond ASCII. As said, for display, you will often need graphics anyway and that will often mean that it only works within your application. Regards, Eric > I'm coordinating a bunch of updates to Frotz[1], including the DOS port. > One of the big enhancements is UTF-8 support for input and output. > This would allow effortless support for accented characters and alternate > alphabets. We've tested games written for Spanish (diacritical marks) > and Russian (Cyrillic alphabet). > So far, I've found absolutely nothing on doing UTF-8 IO on DOS. Is this > something that can be done without bogging down an original IBM PC? How > would I go about doing it? > [1] Z-machine interpreter for Infocom games and others. > See https://gitlab.com/DavidGriffith/frotz/ ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] UTF-8 input and output
Hello Steve Nickolas, My IRC client (which runs fine on a 386/16 at least) internally translates from UTF-8 -> UCS-2 -> native codepage (usually CP437). May I know what particular method(s) and data you use to do the conversions? I was thinking that something like POSIX iconv( ) will come in useful, but Open Watcom does not seem to have such a function. Thank you! -- https://github.com/tkchia ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] UTF-8 input and output
On Mon, 27 May 2019, David Griffith wrote: I'm coordinating a bunch of updates to Frotz[1], including the DOS port. One of the big enhancements is UTF-8 support for input and output. This would allow effortless support for accented characters and alternate alphabets. We've tested games written for Spanish (diacritical marks) and Russian (Cyrillic alphabet). So far, I've found absolutely nothing on doing UTF-8 IO on DOS. Is this something that can be done without bogging down an original IBM PC? How would I go about doing it? [1] Z-machine interpreter for Infocom games and others. See https://gitlab.com/DavidGriffith/frotz/ My IRC client (which runs fine on a 386/16 at least) internally translates from UTF-8 -> UCS-2 -> native codepage (usually CP437). -uso. ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] UTF-8 input and output
Den mån 27 maj 2019 09:38David Griffith skrev: > > I'm coordinating a bunch of updates to Frotz[1], including the DOS port. > One of the big enhancements is UTF-8 support for input and output. This > would allow effortless support for accented characters and alternate > alphabets. We've tested games written for Spanish (diacritical marks) and > Russian (Cyrillic alphabet). > > So far, I've found absolutely nothing on doing UTF-8 IO on DOS. Is this > something that can be done without bogging down an original IBM PC? How > would I go about doing it? > > > [1] Z-machine interpreter for Infocom games and others. See > https://gitlab.com/DavidGriffith/frotz/ > > -- > David Griffith > d...@661.org > > A: Because it fouls the order in which people normally read text. > Q: Why is top-posting such a bad thing? > A: Top-posting. > Q: What is the most annoying thing in e-mail? > > > ___ > Freedos-devel mailing list > Freedos-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/freedos-devel Because it intrigued me, here is a time line: MS-DOS 5 was released in 1991 UTF-8 was defined in 1993 MS-DOS 6.22 was released in 1994 About in 2002 was the first Web pages encoded in UTF-8 encountered by Google. All this according to Wikipedia. ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
[Freedos-devel] UTF-8 input and output
I'm coordinating a bunch of updates to Frotz[1], including the DOS port. One of the big enhancements is UTF-8 support for input and output. This would allow effortless support for accented characters and alternate alphabets. We've tested games written for Spanish (diacritical marks) and Russian (Cyrillic alphabet). So far, I've found absolutely nothing on doing UTF-8 IO on DOS. Is this something that can be done without bogging down an original IBM PC? How would I go about doing it? [1] Z-machine interpreter for Infocom games and others. See https://gitlab.com/DavidGriffith/frotz/ -- David Griffith d...@661.org A: Because it fouls the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing in e-mail? ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel