Re: [lazarus] UTF-8 vs Unicode - Could someone explain?

Marc Weustink Fri, 20 Oct 2006 01:54:46 -0700

Graeme Geldenhuys wrote:

Ok, I'll start up front by announcing my ignorance on these two items:UTF-8 and

Unicode.


After reading some discussions of implementing Unicode/UTF-8 support
in Lazarus I thought I would ask.  Could someone give the watered down
explanation to me (and probably others too)?  My mind is trying to
wrap around the two concepts as one issue, but I believe I'm just
getting myself more confused.

I read in Wikipedia about Unicode and UTF-8, but still it makes no sense.

Also as an example in Object Pascal, what is involved in changing a
function (or app) that uses standard ANSI strings to support UTF-8 or
Unicode or whatever it should be called.

When am I supposed to use String and WideString? Must I change all
references of String to WideString?  Is WideString = UTF-8 or Unicode
or UTF-16 or UCS-2 (whatever the hell that is)?   See my problem... I
am totally lost. :-)


Okay, form what I recall, I'm to lazy to look it up

Unicode is the name for representing chars in multibytes. In the earlydays (and in MS docs) this referred to UCS-2. You use it with the Widefunctions

UCS and UTF are ways to encode those chars, where UCS was meant as afixed width and UTF as variable width. However, a few years ago, onecame to the conclusion that not all characters would fit in a UCS-2 set.(you need 2 or 3 of them = planes)It was decided that UCS-2 is encoded like UTF-16. So nowadays they arethe same.

If you read older documents, then USC-2 is called Unicode.

Now what does this mean for us.

AnsiStrings are 1 byte wide and WideStrings are 2 byte wide. Fixed.

The delphi WideStrings are the same as used by MS in their Widefunctions, being UCS-2 initially. So all unicode chars would fit inthere (initially).

However returning to this century, being all UTF, there is no directmapping between char and string index, since all chars vary in width (intheory)For storage, you can use a AnsiString for storing UTF-8 and a WideStringfor storing UTF-16.It is just storage, since we don't have a UTF8String or UTF16Stringtype, the compiler doesn't know the contents of the string, so noconversion is done.


Can we use them like we used to ? In most cases yes.

(speaking for UTF-8, same counts for UTF-16)

The way UTF is constructed, there will never be a byte valuerepresenting a single byte char in one of the multi byte encodings.This means that for instance we still can use Pos() for indexing acharacter, however, the index returned may be the start of a multibytesequence.

So the next char is located at index + length(utfchar)

I hope this made some things clear.

Marc

Example:

function MyFooBar(const AStrValue: String); String
begin
 .... do whatever in here
 Result := <some string value>;
end;

....

var
 s, r: String;
begin
  s := "Graeme";
  r := MyFooBar(s);
  ...
end;


Many thanks in advance,
 - Graeme -


_________________________________________________________________
    To unsubscribe: mail [EMAIL PROTECTED] with
               "unsubscribe" as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives

Re: [lazarus] UTF-8 vs Unicode - Could someone explain?

Reply via email to