Graeme Geldenhuys wrote:
Ok, I'll start up front by announcing my ignorance on these two items:
UTF-8 and
Unicode.
After reading some discussions of implementing Unicode/UTF-8 support
in Lazarus I thought I would ask. Could someone give the watered down
explanation to me (and probably others too)? My mind is trying to
wrap around the two concepts as one issue, but I believe I'm just
getting myself more confused.
I read in Wikipedia about Unicode and UTF-8, but still it makes no sense.
Also as an example in Object Pascal, what is involved in changing a
function (or app) that uses standard ANSI strings to support UTF-8 or
Unicode or whatever it should be called.
When am I supposed to use String and WideString? Must I change all
references of String to WideString? Is WideString = UTF-8 or Unicode
or UTF-16 or UCS-2 (whatever the hell that is)? See my problem... I
am totally lost. :-)
Okay, form what I recall, I'm to lazy to look it up
Unicode is the name for representing chars in multibytes. In the early
days (and in MS docs) this referred to UCS-2. You use it with the Wide
functions
UCS and UTF are ways to encode those chars, where UCS was meant as a
fixed width and UTF as variable width. However, a few years ago, one
came to the conclusion that not all characters would fit in a UCS-2 set.
(you need 2 or 3 of them = planes)
It was decided that UCS-2 is encoded like UTF-16. So nowadays they are
the same.
If you read older documents, then USC-2 is called Unicode.
Now what does this mean for us.
AnsiStrings are 1 byte wide and WideStrings are 2 byte wide. Fixed.
The delphi WideStrings are the same as used by MS in their Wide
functions, being UCS-2 initially. So all unicode chars would fit in
there (initially).
However returning to this century, being all UTF, there is no direct
mapping between char and string index, since all chars vary in width (in
theory)
For storage, you can use a AnsiString for storing UTF-8 and a WideString
for storing UTF-16.
It is just storage, since we don't have a UTF8String or UTF16String
type, the compiler doesn't know the contents of the string, so no
conversion is done.
Can we use them like we used to ? In most cases yes.
(speaking for UTF-8, same counts for UTF-16)
The way UTF is constructed, there will never be a byte value
representing a single byte char in one of the multi byte encodings.
This means that for instance we still can use Pos() for indexing a
character, however, the index returned may be the start of a multibyte
sequence.
So the next char is located at index + length(utfchar)
I hope this made some things clear.
Marc
Example:
function MyFooBar(const AStrValue: String); String
begin
.... do whatever in here
Result := <some string value>;
end;
....
var
s, r: String;
begin
s := "Graeme";
r := MyFooBar(s);
...
end;
Many thanks in advance,
- Graeme -
_________________________________________________________________
To unsubscribe: mail [EMAIL PROTECTED] with
"unsubscribe" as the Subject
archives at http://www.lazarus.freepascal.org/mailarchives