Apart from the mentioned implementation flaws, I came across severe problems with the new AnsiString *model* in general. Let's play around with the Pos() function, which certainly is an inevitable part of any stringhandling.

A general function
 function Pos(SubStr: T1; Str: T2): integer;
should return the character index of SubStr in Str, i.e. Str[i] should definitely be the begin of SubStr within Str.

It also should be possible to find the end of SubStr within Str, in order to e.g. return the remainder of the text.

With multiple coexisting string encodings we have to solve the following problems:

A reasonable result, i.e. the index in the given string, of the given encoding T2, will require to convert the search string SubStr into exactly that encoding. This requires two conversions, from T1 into UTF-8 (or UTF-16) and then into T2. Clearly this can be prevented by using strings of only one encoding, but what about string literals? When a string literal has to be converted, it most probably ends up in UTF-8/16 encoding, what would cause the Unicode version of Pos() being called, resulting in a wrong result. Even if we assume that string literals are stored as native (CP_ACP) strings, or as Unicode, what actually depends on compiler directives, a couple of overloaded Pos() functions had to be added, when an unwanted conversion of *both* arguments into UTF-16 shall be avoided.

The only possible solution were IMO a
 function Pos(SubStr: UnicodeString; Str: RawByteString): integer;
in the *hope* that this version takes precedence over the all-Unicode version.

But when we have the begin of the substring, how do we find its end?
Here Length(SubStr) is of little help, since it represents the number of bytes in encoding T1, useless with T2. So we need a feature to determine the length of an string in any (supported) encoding, like:
  function EncodedLength(s: string; cp: TEncoding): integer;

Or we add a function
 function EndPos(SubStr: T1; Str: T2): integer;
returning the index of the char following SubStr in Str.

Or we combine both, into
function Pos2(SubStr: T1; Str: T2; out begIndex, endIndex: integer): boolean;
with the result eventually indicating whether SubStr was found in Str.


But even if we implement all that, and use it *everywhere* in our code, the chance for any number of implicit encoding conversions remains :-(

Do you see any chance to reduce the number of possible conversions, other than by using only one single encoding throughout RTL and application code?

But what's the use of strings with a stored encoding, then? Except for strict compatibilty with a flawed Delphi model and implementation, that may be dropped again in the next Delphi version?

DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Reply via email to