Christian Masloch wrote:
> 
> I think it should be accurate for file systems. Such a "useful"
> translation is a good concept for displaying output (maybe even that of
> the DIR command) but not for actually working with the file system. 
> Keyboard input can't map one key to several characters at once (unless you
> randomly (-; decide which one to use) so input handling should use
> one-to-one translation too.
> 

Agreed.  Just further fuel to the fire that both types of translations are
needed (depending on the specific application, even if the application is
"the kernel"), and that this is not a trivial matter.  UniCode is not the
panacea it's purported to be.


Christian Masloch wrote:
> 
> UTF-8 is independent of byte-order. The exact encoding (and byte-order)
> should always either be implicit (in the interface's or format's
> definition) or be marked in some way.
> 

I don't think there is a way to automatically determine the encoding from
the data itself, and the only way to determine the byte-order (assuming it's
not UTF-8, not a single character, and is unknown from the context) is to
include the special BOM (Byte Order Mark) character as the first character
of the string.  In fact, according to the UniCode spec, if the BOM is not
included and the byte-order is not clear from the context, you're supposed
to assume big-endian.

For file system and similar applications, the interface could just always
assume a specific format (probably either UTF-8 or UTF-16LE).  For a
general-purpose interface, though, you should be able to handle all
different kinds of possibilities (including things like "UTF-24" and
"UTF-64").  Also, even though you're dealing with DOS doesn't necessarily
mean everything will be little-endian -- it depends on the source of the
data.  Certain hardware interfaces (like SCSI) are inherently big-endian,
and data downloaded from external sources can be almost anything.


Christian Masloch wrote:
> 
> The definition of a string's length (possibly number of
> bytes/words/dwords, number of code-points, number of "characters") need
> not be addressed by such an interface. If there is a need for a buffer or
> string length (see below) a new interface should just define that all
> "length" fields/parameters give the length in bytes.
> 

Another possibility is what my UNI2ASCI program does, which is accept
strings terminated with a specific character (in my case, the UniCode NUL
character, conceptually similar to ASCIIZ).  A general-purpose program
should provide more than one way to define a string's length.  If you limit
input to only certain encodings or byte-orders or string/character types,
then it ceases to be "general-purpose".  Maybe a general-purpose program is
not what we're really talking about here, but I think one needs to be
developed.

Bret
-- 
View this message in context: 
http://old.nabble.com/ASCII-to-unicode-table-tp30317777p30341668.html
Sent from the FreeDOS - Dev mailing list archive at Nabble.com.


------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel

Reply via email to