On 11/27/2014 03:44 PM, Hans-Peter Diettrich wrote:
The "universal paradigm" would allow for extensions (e.g. UTF-32, multiple 16 Bit Code pages, an additional fully dynamic String type, n-byte "un-encoded" string types), as I described in the Wiki page.

Even if feasable, such arbitrary string storage can dramatically increase the number of implicit string conversions.

Of course it can do harm on that behalf, if the user is silly enough to *explicitly* define variables in a brand without thinking about what he is doing. But this exactly the same when he just uses the stuff currently offered by Delphi and fpc. If you arbitrary define code pages for variables for your 8 bit ("ANSI") strings you will enforce many conversions.

Currently in Delphi if you don't define special code pages anything will be UTF-16. So no unnecessary conversions.

In fpc (and maybe Lazarus, as well) I suppose the way currently in the works is (when not changing the Default behavior by certain options): - when compiling for Windows, "String" is UTF-16, and the RTL and LCL ubiquitously use "String": So no unnecessary conversion - when compiling for Linux, "String" is UTF-8, and the RTL and LCL ubiquitously use "String": So no unnecessary conversion, either.

If this is done in the libraries (e.g. RTL and LCL) and in user code, this would allow for as little conversions as possible and thus best performance. Here, you would need different library binaries which might or might not be a problem.

But of course the portability is very questionable (including, but not limited to the fact that the result of "pos" is different)-

When (on top of this) doing the interfaces to libraries (including TStrings) with "DynamicString" (encoding brand "CP_ANY"), no additional conversions would be necessary, as - because all other Strings use the same encoding brand (either UTF-16 or UTF-8, depending on the OS) and hence the dynamic encoding of all DynamicStrings used would always be exactly that brand. Hence, IMHO, this would nor harm at all, as the overhead the compiler needs to implement to just check the dynamic type brand and find that no conversion is necessary is extremely small.

But now the user has a choice !

- If he does not do anything regarding the encoding brand of his strings, he will not notice the existence of the DynamicString Type at all. Not even Performance-wise. (But he might encounter portability issues.) - if he decides that he wants to use a dedicated encoding brand in all or parts of his code, he of course needs to know what he is doing. This can result
   - in improved portability (if decently done)
- in improved performance (if decently done) e.g. by using on-byte strings for compact storing the information and two-byte strings for e.g. search loops, or using the best fitting encoding in the loops in the user code while allowing auto-conversion when accessing the libraries in case the underlying OS enforces a different encoding. - in disastrous increase of auto-conversions and thus performance degradation, (if not decently done).


An *efficient* implementation would be based on a single program-wide string representation, with different encodings being handled only in an exchange with external data sources.
Yep. But it would result in severe user code portability issues (see above). IMHO using DynamicString at the correct locations would not be (noticeably) less efficient but a lot more versatile.


<Cassandra>
After all I have the impression that the known RawByteString flaws will never be fixed in Delphi, in order to encourage the users to take the step to UnicodeString. Now the question is whether these flaws are fixed in FPC, or whether Lazarus will become the first project that definitely requires an complete move to UnicodeString, for reliable operation.
For best support of non-UTF-16 platforms I'd suggest to fix the flaws...
</Cassandra>
I also don't think we will ever see a fix for the poor implementation of RawByteString (avoiding the word flaw and the suggestion of a bad purpose), because it would brake existing user code. Regarding fpc, "correcting the flaws" and keeping the name RawByteString would result in incompatibility issues vs Delphi and breaking code that will be ported from Delphi.

That is why fpc would need to define an additional type name (e.g "DynamicString") and encoding brand number (e.g. "CP_ANY" = $FF00) for a decently usable type for intermediately holding a String content. (see Wiki -> http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support )

RawXxxString can be used for really "uncoded" data as done with old-style strings in a lot of applications. Even if "seriously flawed" auto-conversion might be implemented in fpc for RawByteStrimg (for Delphi-compatibility), the user can easily avoid it by not directly combining RAW and differently statically encoded strings in an operation.

-Michael



_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to