Paul Ishenin schrieb:

What's CP_NONE? Value and purpose?

RawByteString codepage. Value $FFFF and purpose - inform that string has no codepage assigned. I think at the moment compiler does not produce strings of codepage $FFFF anymore but before it did. So now we can probably clear the RTL from this codepage checks.

Thanks :-)


It turned out that the result only is correct when at least one of the
strings is an UnicodeString. Otherwise Pos seems to end up in a
RawByteString compare, with the encoding ignored.

That's because if one UnicodeString type is present another Pos() works. In this case the second RawByteString argument converts into UnicodeString with taking encoding into account.

Pos accepts only strings of the same type, with AnsiStrings (any codepage) being passed as RawByteStrings. When one argument is a UnicodeString, the other argument is converted to Unicode as well. This again is a source of trouble, because
  pos(string(s1251), s866)
will return the index in the *Unicode* string, into which s866 is implicitly converted :-(

The following test also tends to fail:
  i := pos(string(s1251), sUtf8);
  rest := Copy(s866, i+Length(sUtf8), 10);
The first bug is the index, which is wrong with MBCS characters in sUtf8, the second bug is the possibly different Length of the substr, in cp_866 and cp_UTF8.

Unless the new AnsiString support is improved considerably (in Delphi or FPC), such string types are quite useless. At least it looks mandatory that the RTL, other packages *and* the application use only strings of the same encoding, so that no implicit conversions are necessary (except between AnsiString and UnicodeString, as is). Then also the old string header record can be used, no need to put in an encoding.


As a workaround I'd suggest that RawByteString Pos() converts the SubStr into the encoding of the *second* string, so that the comparison finds the correct index, applicable to the original string.


Old Pos() works without codepage conversions. This shows the test I gave and other tests.

Old Pos() and old AnsiString, as well as ShortString, assumed native encoding, so there existed no need for codepage conversions. UTF-8 strings deserved special care, because no subroutine could detect the encoding of an string parameter. The new AnsiString types *should* cure that problem, but obviously they don't (yet) :-(

DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Reply via email to