Christian Masloch wrote: > > I think it should be accurate for file systems. Such a "useful" > translation is a good concept for displaying output (maybe even that of > the DIR command) but not for actually working with the file system. > Keyboard input can't map one key to several characters at once (unless you > randomly (-; decide which one to use) so input handling should use > one-to-one translation too. >
Agreed. Just further fuel to the fire that both types of translations are needed (depending on the specific application, even if the application is "the kernel"), and that this is not a trivial matter. UniCode is not the panacea it's purported to be. Christian Masloch wrote: > > UTF-8 is independent of byte-order. The exact encoding (and byte-order) > should always either be implicit (in the interface's or format's > definition) or be marked in some way. > I don't think there is a way to automatically determine the encoding from the data itself, and the only way to determine the byte-order (assuming it's not UTF-8, not a single character, and is unknown from the context) is to include the special BOM (Byte Order Mark) character as the first character of the string. In fact, according to the UniCode spec, if the BOM is not included and the byte-order is not clear from the context, you're supposed to assume big-endian. For file system and similar applications, the interface could just always assume a specific format (probably either UTF-8 or UTF-16LE). For a general-purpose interface, though, you should be able to handle all different kinds of possibilities (including things like "UTF-24" and "UTF-64"). Also, even though you're dealing with DOS doesn't necessarily mean everything will be little-endian -- it depends on the source of the data. Certain hardware interfaces (like SCSI) are inherently big-endian, and data downloaded from external sources can be almost anything. Christian Masloch wrote: > > The definition of a string's length (possibly number of > bytes/words/dwords, number of code-points, number of "characters") need > not be addressed by such an interface. If there is a need for a buffer or > string length (see below) a new interface should just define that all > "length" fields/parameters give the length in bytes. > Another possibility is what my UNI2ASCI program does, which is accept strings terminated with a specific character (in my case, the UniCode NUL character, conceptually similar to ASCIIZ). A general-purpose program should provide more than one way to define a string's length. If you limit input to only certain encodings or byte-orders or string/character types, then it ceases to be "general-purpose". Maybe a general-purpose program is not what we're really talking about here, but I think one needs to be developed. Bret -- View this message in context: http://old.nabble.com/ASCII-to-unicode-table-tp30317777p30341668.html Sent from the FreeDOS - Dev mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev _______________________________________________ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel