Re: [lazarus] UTF-8 vs Unicode - Could someone explain?

Marc Weustink Fri, 20 Oct 2006 08:14:07 -0700

Graeme Geldenhuys wrote:

Thanks Marc,


It was a real informative explanation, and helped a lot.  Now lets see
if I understood it correctly. :-)

AnsiStrings are 1 byte wide and WideStrings are 2 byte wide. Fixed.
The delphi WideStrings are the same as used by MS in their Wide
functions, being UCS-2 initially. So all unicode chars would fit in
there (initially).


Now this brings me to another point which makes no sense! Naming
convertion of functions.
Lets look at the following RTL function as an example:

* StrPos is used for 1 byte (8 bit) ANSI strings.

* AnsiStrPos is used for multi-byte (or is that 2 bytes max) UTF-8
strings. aka WideString.


afair, UTF-8 characters can be build up to 5 bytes.

WideStrings are strings where each element consists of 1 word (=2bytes). Note that I call it element and not character, since a charactermay be build up using one or more elements.

So why did Borland name it AnsiStrPos, when it doesn't operate on ANSI
strings!!  Why not name it Utf8StrPos or WideStrPos?  The prefix Ansi*
completely goes against what it does (operates on)! It doesn't work
with Ansi strings, but rather WideStrings.


nope, it does work on AnsiStrings

resume:
  Ansistring: array of 1 byte elements
  WideString: array of 1 word (=2 byte) elements

Now here is another piece of code - parts removed for the purpose of
simplicity. To protect the innocent, we will keep the author
anonymous. ;-)

The code below is a function that outputs Unicode or Ansi text to a canvas.
Is my assumtions about this code correct?

1... I assume that the String type, as used in the parameter "AText:
String", can hold ANSI or Unicode text. The function supports both,
but isn't sure what it is going to get.

Yes

2... The function Utf8ToUnicode tells me, that it is going to process
a String (AText) to Unicode using the UTF-8 algorithm.

it assumes the string is UTF8 encoded, and it returns the number ofbytes needed to construct this string as widestring.

3... The function Utf8ToAnsi tell me, that it is going to process a
String (AText) which might contain UTF-8 encoded text to 8-bit ANSI
text.


Yes.

4... If this code only supported Unicode (UTF8), we could have defined
the parameter AText as UTF8String and remove the second part of the if
statement inside the function.


The problem is that windows only supports 2 encodings:
1) Ansi, each char is coded in exactly 1 byte

2) Wide, where I dont know what the current state is what MS supports.It used to be each char is coded in exactly 1 word (=2bytes). Being theold UCS-2 interpretation. Maybe they currently support UTF-16 on newer OSes.


So you either call Windows.TextOutA or Windows.TextOutW.

Note:

in the C headers the function TextOut is defined as either TextOutAor TextOutW, depending on a define if you want a Ansi or Wide api.

On most pascal translations, the function TextOut is mapped to TextOutA

Marc

Is all this correct?

--------------------------------
procedure TGDICanvas.DoTextOut(const AText: String);
var
 WideText: PWideChar;
 AnsiText: string;
begin
 if UnicodeEnabledOS then
 begin
   Size := Utf8ToUnicode(nil, PChar(AText), 0);
   WideText := GetMem(Size);
   Utf8ToUnicode(WideText, PChar(AText), Size);
   dynWindows.TextOutW(.... WideText....);
   FreeMem(WideText);
 end
 else
 begin
   AnsiText := Utf8ToAnsi(AText);
   Windows.TextOut(.... PChar(AnsiText) .....);
 end;
end;
--------------------------------


Regards,
 - Graeme -

_________________________________________________________________
    To unsubscribe: mail [EMAIL PROTECTED] with
               "unsubscribe" as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives


_________________________________________________________________
    To unsubscribe: mail [EMAIL PROTECTED] with
               "unsubscribe" as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives

Re: [lazarus] UTF-8 vs Unicode - Could someone explain?

Reply via email to