Re: [fpc-devel] new string - question on usage

Hans-Peter Diettrich Thu, 13 Oct 2011 03:59:34 -0700

Paul Ishenin schrieb:

What's CP_NONE? Value and purpose?
RawByteString codepage. Value $FFFF and purpose - inform that string hasno codepage assigned. I think at the moment compiler does not producestrings of codepage $FFFF anymore but before it did. So now we canprobably clear the RTL from this codepage checks.


Thanks :-)

It turned out that the result only is correct when at least one of the
strings is an UnicodeString. Otherwise Pos seems to end up in a
RawByteString compare, with the encoding ignored.
That's because if one UnicodeString type is present another Pos() works.In this case the second RawByteString argument converts intoUnicodeString with taking encoding into account.

Pos accepts only strings of the same type, with AnsiStrings (anycodepage) being passed as RawByteStrings. When one argument is aUnicodeString, the other argument is converted to Unicode as well. Thisagain is a source of trouble, because

  pos(string(s1251), s866)

will return the index in the *Unicode* string, into which s866 isimplicitly converted :-(


The following test also tends to fail:
  i := pos(string(s1251), sUtf8);
  rest := Copy(s866, i+Length(sUtf8), 10);

The first bug is the index, which is wrong with MBCS characters insUtf8, the second bug is the possibly different Length of the substr, incp_866 and cp_UTF8.

Unless the new AnsiString support is improved considerably (in Delphi orFPC), such string types are quite useless. At least it looks mandatorythat the RTL, other packages *and* the application use only strings ofthe same encoding, so that no implicit conversions are necessary (exceptbetween AnsiString and UnicodeString, as is). Then also the old stringheader record can be used, no need to put in an encoding.

As a workaround I'd suggest that RawByteString Pos() converts the SubStrinto the encoding of the *second* string, so that the comparison findsthe correct index, applicable to the original string.

Old Pos() works without codepage conversions. This shows the test I gaveand other tests.

Old Pos() and old AnsiString, as well as ShortString, assumed nativeencoding, so there existed no need for codepage conversions. UTF-8strings deserved special care, because no subroutine could detect theencoding of an string parameter. The new AnsiString types *should* curethat problem, but obviously they don't (yet) :-(


DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] new string - question on usage

Reply via email to