Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"

Michael Schnell Fri, 28 Nov 2014 04:42:33 -0800

On 11/27/2014 03:44 PM, Hans-Peter Diettrich wrote:

The "universal paradigm" would allow for extensions (e.g. UTF-32,multiple 16 Bit Code pages, an additional fully dynamic String type,n-byte "un-encoded" string types), as I described in the Wiki page.
Even if feasable, such arbitrary string storage can dramaticallyincrease the number of implicit string conversions.

Of course it can do harm on that behalf, if the user is silly enough to*explicitly* define variables in a brand without thinking about what heis doing. But this exactly the same when he just uses the stuffcurrently offered by Delphi and fpc. If you arbitrary define code pagesfor variables for your 8 bit ("ANSI") strings you will enforce manyconversions.

Currently in Delphi if you don't define special code pages anything willbe UTF-16. So no unnecessary conversions.

In fpc (and maybe Lazarus, as well) I suppose the way currently in theworks is (when not changing the Default behavior by certain options):- when compiling for Windows, "String" is UTF-16, and the RTL and LCLubiquitously use "String": So no unnecessary conversion- when compiling for Linux, "String" is UTF-8, and the RTL and LCLubiquitously use "String": So no unnecessary conversion, either.

If this is done in the libraries (e.g. RTL and LCL) and in user code,this would allow for as little conversions as possible and thus bestperformance. Here, you would need different library binaries which mightor might not be a problem.

But of course the portability is very questionable (including, but notlimited to the fact that the result of "pos" is different)-

When (on top of this) doing the interfaces to libraries (includingTStrings) with "DynamicString" (encoding brand "CP_ANY"), no additionalconversions would be necessary, as - because all other Strings use thesame encoding brand (either UTF-16 or UTF-8, depending on the OS) andhence the dynamic encoding of all DynamicStrings used would always beexactly that brand. Hence, IMHO, this would nor harm at all, as theoverhead the compiler needs to implement to just check the dynamic typebrand and find that no conversion is necessary is extremely small.


But now the user has a choice !

- If he does not do anything regarding the encoding brand of hisstrings, he will not notice the existence of the DynamicString Type atall. Not even Performance-wise. (But he might encounter portability issues.)- if he decides that he wants to use a dedicated encoding brand in allor parts of his code, he of course needs to know what he is doing. Thiscan result

   - in improved portability (if decently done)

- in improved performance (if decently done) e.g. by using on-bytestrings for compact storing the information and two-byte strings fore.g. search loops, or using the best fitting encoding in the loops inthe user code while allowing auto-conversion when accessing thelibraries in case the underlying OS enforces a different encoding.- in disastrous increase of auto-conversions and thus performancedegradation, (if not decently done).

An *efficient* implementation would be based on a single program-widestring representation, with different encodings being handled only inan exchange with external data sources.

Yep. But it would result in severe user code portability issues (seeabove). IMHO using DynamicString at the correct locations would not be(noticeably) less efficient but a lot more versatile.

<Cassandra>
After all I have the impression that the known RawByteString flawswill never be fixed in Delphi, in order to encourage the users to takethe step to UnicodeString. Now the question is whether these flaws arefixed in FPC, or whether Lazarus will become the first project thatdefinitely requires an complete move to UnicodeString, for reliableoperation.
For best support of non-UTF-16 platforms I'd suggest to fix the flaws...
</Cassandra>

I also don't think we will ever see a fix for the poor implementation ofRawByteString (avoiding the word flaw and the suggestion of a badpurpose), because it would brake existing user code.Regarding fpc, "correcting the flaws" and keeping the name RawByteStringwould result in incompatibility issues vs Delphi and breaking code thatwill be ported from Delphi.

That is why fpc would need to define an additional type name (e.g"DynamicString") and encoding brand number (e.g. "CP_ANY" = $FF00) for adecently usable type for intermediately holding a String content. (seeWiki ->http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support)

RawXxxString can be used for really "uncoded" data as done withold-style strings in a lot of applications. Even if "seriously flawed"auto-conversion might be implemented in fpc for RawByteStrimg (forDelphi-compatibility), the user can easily avoid it by not directlycombining RAW and differently statically encoded strings in an operation.


-Michael



_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"

Reply via email to