Re: [fpc-pascal] Unicode chars losing information
On Tue, 9 Mar 2021, Florian Klämpfl via fpc-pascal wrote: By using the necessary IFDEF mechanism in the config file, we can avoid inserting it for windows (which does not need it) or the smaller embedded platforms (which cannot handle it). People that don't need/want this can remove the config setting from the file. All the others leave it as-is and will get their desired conversion mechanisms 'for free'. This way a default choice is made for you on those platforms, but you can still 100% control it. I am very much against this because this means that a default FPC executable would link against libc. And this is far too much only because a few people complain because they didn’t read the docs. Well, maybe the Lazarus IDE can insert the necessary units, just like it is done for cthreads... Michael.___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
> Am 09.03.2021 um 10:06 schrieb Michael Van Canneyt via fpc-pascal > : > > > >> On Tue, 9 Mar 2021, Graeme Geldenhuys via fpc-pascal wrote: >> >>> On 09/03/2021 1:44 am, Tomas Hajny via fpc-pascal wrote: >>> UnicodeString may be used in a program simply because the included unit has >>> it used in its interface. That may be the case even if there's no use of >>> characters outside of US ASCII at all. >> >> So FPC rather goes with the fact that data may be *silently* lost during >> encoding conversions? That doesn't seem like a safe default behaviour to >> me. > > No, we give the programmer a choice: * Not use unicode conversion at all. > * Use the C library to handle conversion (cwstring). > * Use FPC native code to handle conversion (fpwidestring). > * Some other means. > > Since the compiler cannot reliably detect that a choice was made, it also > cannot make the choice for you, because the choice also cannot be undone by > the compiler. > > This mechanism implies the programmer *has* to make that choice. > > This is not different from the threading driver mechanism, for which Lazarus > adds > some {$IFDEF } mechanisms in the program uses clause. > > But, I have been thinking about this. What we can do to alleviate this is the > following: > > Use the -FaNNN option of the command line. > > This option will insert NNN implicitly in the uses clause of the program. > > So, we can add -Fafpwidestring > or > -Facwstring > > in the default generated fpc.cfg config file for selected platforms (mac, > linux > i386,64-bit, *bsd). The result will be that a widestring driver unit will be > inserted by default for those platforms. > > By using the necessary IFDEF mechanism in the config file, we can avoid > inserting it for windows (which does not need it) or the smaller embedded > platforms > (which cannot handle it). > > People that don't need/want this can remove the config setting from the file. > All the others leave it as-is and will get their desired conversion mechanisms > 'for free'. > > This way a default choice is made for you on those platforms, but you can > still 100% control > it. I am very much against this because this means that a default FPC executable would link against libc. And this is far too much only because a few people complain because they didn’t read the docs. > > Michael. > ___ > fpc-pascal maillist - fpc-pascal@lists.freepascal.org > https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On 2021-03-09 09:46, Graeme Geldenhuys via fpc-pascal wrote: On 09/03/2021 1:44 am, Tomas Hajny via fpc-pascal wrote: UnicodeString may be used in a program simply because the included unit has it used in its interface. That may be the case even if there's no use of characters outside of US ASCII at all. So FPC rather goes with the fact that data may be *silently* lost during encoding conversions? That doesn't seem like a safe default behaviour to me. The same happens e.g. if you configure your terminal to use a font that doesn't contain all the characters which may appear in the output - the compiler cannot know all the circumstances and thus cannot handle all of them; among others due to the fact that there are decisions to be made based on weighing pros and cons in the particular use case and those simply aren't 'one size fits all'. Tomas ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On Tue, 9 Mar 2021, Mattias Gaertner via fpc-pascal wrote: On Tue, 9 Mar 2021 08:04:54 +0100 Sven Barth via fpc-pascal wrote: [...] FPC is not Java. In FPC you have more fine-grained control over the resulting binary than "install big, fat runtime". Not to mention that FPC can target resource constrained systems as well. Optional is good. Maybe the defaults can be changed. For example the macOS dmg and Linux-x86-64 debs/rpms could install an fpc.cfg containing #ifndef FPNonUnicode -Facwstring -Fcutf-8 #endif For minimal programs pass -dFPNonUnicode Our mails crossed. That corresponds to what I proposed, with minor differences. The additional #ifndef FPNonUnicode is also a good idea. Michael. ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On Tue, 9 Mar 2021, Graeme Geldenhuys via fpc-pascal wrote: On 09/03/2021 1:44 am, Tomas Hajny via fpc-pascal wrote: UnicodeString may be used in a program simply because the included unit has it used in its interface. That may be the case even if there's no use of characters outside of US ASCII at all. So FPC rather goes with the fact that data may be *silently* lost during encoding conversions? That doesn't seem like a safe default behaviour to me. No, we give the programmer a choice: * Not use unicode conversion at all. * Use the C library to handle conversion (cwstring). * Use FPC native code to handle conversion (fpwidestring). * Some other means. Since the compiler cannot reliably detect that a choice was made, it also cannot make the choice for you, because the choice also cannot be undone by the compiler. This mechanism implies the programmer *has* to make that choice. This is not different from the threading driver mechanism, for which Lazarus adds some {$IFDEF } mechanisms in the program uses clause. But, I have been thinking about this. What we can do to alleviate this is the following: Use the -FaNNN option of the command line. This option will insert NNN implicitly in the uses clause of the program. So, we can add -Fafpwidestring or -Facwstring in the default generated fpc.cfg config file for selected platforms (mac, linux i386,64-bit, *bsd). The result will be that a widestring driver unit will be inserted by default for those platforms. By using the necessary IFDEF mechanism in the config file, we can avoid inserting it for windows (which does not need it) or the smaller embedded platforms (which cannot handle it). People that don't need/want this can remove the config setting from the file. All the others leave it as-is and will get their desired conversion mechanisms 'for free'. This way a default choice is made for you on those platforms, but you can still 100% control it. Michael. ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On Tue, 9 Mar 2021 08:04:54 +0100 Sven Barth via fpc-pascal wrote: >[...] > FPC is not Java. In FPC you have more fine-grained control over the > resulting binary than "install big, fat runtime". Not to mention that > FPC can target resource constrained systems as well. Optional is good. Maybe the defaults can be changed. For example the macOS dmg and Linux-x86-64 debs/rpms could install an fpc.cfg containing #ifndef FPNonUnicode -Facwstring -Fcutf-8 #endif For minimal programs pass -dFPNonUnicode Mattias ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On 09/03/2021 1:44 am, Tomas Hajny via fpc-pascal wrote: > UnicodeString may be used in a program simply because the included unit > has it used in its interface. That may be the case even if there's no > use of characters outside of US ASCII at all. So FPC rather goes with the fact that data may be *silently* lost during encoding conversions? That doesn't seem like a safe default behaviour to me. Regards, Graeme ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On Tue, 9 Mar 2021, Graeme Geldenhuys via fpc-pascal wrote: On 08/03/2021 2:49 pm, Michael Van Canneyt via fpc-pascal wrote: In that sense, unicode conversion support is something optional and so we require you to enable it explicitly, since enabling it has some drawbacks: Surely if you explicitly use the UnicodeString type, the compiler should know you are using UTF-16 (the default encoding of said type), so why not include the required units implicitly. It doesn't make sense otherwise. The system unit is full of unicodestring typed routines, same for sysutils. Mostly they are overloads of single-byte versions of the same call. Being on Linux, I use only the UTF8 single-byte version of these calls. So no, I don't need UTF16 despite that these calls are present in units that I am using. So I know this, but the compiler does not. Maybe with WPO the compiler would be able to deduce it, but even then I am not sure it can establish this with 100% certainty. Michael. ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
Graeme Geldenhuys via fpc-pascal schrieb am Di., 9. März 2021, 00:56: > > On 07/03/2021 5:48 pm, Nikolay Nikolov via fpc-pascal wrote: > > It depends on what you mean by "just working". > > No, "just worked" is exactly what it says on the tin. It is FPC that > overcomplicating matters. > > > As an example, here is Java that also uses UTF-16 encoding, just like > FPC's UnicodeString type. > > > $ cat UnicodeTest.java > class UnicodeTest { > > public static void main(String[] args) { > String s = "⌘⌥⌫⇧^"; > System.out.println(s); > System.out.println(s.charAt(0)); > System.out.println(String.format("%x",s.codePointAt(0))); > } > } > > > Now lets compile and run that. > > $> javac UnicodeTest.java > $> java UnicodeTest > Picked up _JAVA_OPTIONS: -Dawt.useSystemAAFontSettings=on > ⌘⌥⌫⇧^ > ⌘ > 2318 > > Yes, it just worked. > That is because the Java runtime contains all the conversion code necessary. In FPC we simply don't do that, cause it either requires linking to the C library (especially for simple utilities that can be easily avoided) or requires a huge amount of conversion tables. Thus developers need to explicitly opt in for using Unicode conversions by including a WideString manager. FPC is not Java. In FPC you have more fine-grained control over the resulting binary than "install big, fat runtime". Not to mention that FPC can target resource constrained systems as well. Regards, Sven ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On 3/9/21 2:18 AM, Graeme Geldenhuys via fpc-pascal wrote: On 08/03/2021 7:49 pm, Jonas Maebe via fpc-pascal wrote: It's not possible to safely use unicodestring without knowing how 16bit unicode works. The compiler can't solve that. I disagree. Java does just that! The issue is the assumption of using array indexing into the a string. I guess developers should stop doing that. The important point is: But developer should be able to use Unicode strings without needing to know the is and outs of Unicode and UTF-16 encoding. At least that's what's possible with Java and other languages. Yes, you absolutely need to know the ins and outs of Unicode in order to know how to extract the first character of a string. First of all, what is a character? A UTF-16 code unit, a Unicode code point or an extended grapheme cluster? Your Java code only does the expected thing for a certain subset of characters. If you write your code like that, you're going to think your code works, but it would fail on strings with either non-BMP characters (if you use charAt) or strings with combining characters (if you use codePointAt). To split the string into user perceived characters you need to do this in FPC trunk: uses graphemebreakproperty, fpwidestring; var EGC, S: UnicodeString; begin S := 'Хей, помисли́ си!'; for EGC in TUnicodeStringExtendedGraphemeClustersEnumerator.Create(S) do Writeln(EGC); end; Can Java do that? No, it appears it can't: https://stackoverflow.com/questions/40878804/how-to-count-grapheme-clusters-or-perceived-emoji-characters-in-java Neither charAt, nor codePointAt will work for the 'и́'. CharAt will also fail at ''. Please correct me if I'm wrong, I didn't test this in Java. FPC (and Delphi) really need to get with the times. If by "get with the times" you mean always include the fpwidestring unit and still produce less bloat than the JVM, then sure, we can do that, but some people appreciate the flexibility of choosing your own wide string manager or not including it for programs that don't need it. And for things like splitting a string into characters, you really need to know what you're doing anyway, since a Unicode codepoint very rarely corresponds to what users perceive as a character. Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On 08/03/2021 23:26, Tomas Hajny via fpc-pascal wrote: On 2021-03-08 21:36, Martin Frb via fpc-pascal wrote: I can think of 2 groups already. 1) Conversion due to explicit declared different encoding. AnAnsiString := SomeWideString; AnAsciiString := AnUtf8String; // declared as "type AnsiString(CP_ASCII);" and "type AnsiString(CP_UTF8);" Do you mean a compile-time warning? The trouble is that the compiler wouldn't know whether a real widestring manager would get included in the final binary when such conversions are encountered. And remember that the final binary may be compiled at a different time from the moment when the unit containing such conversions is compiled. In other words, compile-time warnings would be rather difficult to implement. Yes, I mean a compile time warning. But, not in the above case. In the above case the users could kind of reasonably be expected to know a widestring manager is needed. However, IMHO that differs in the below case: 2) Conversion where at least one string is not explicitly declared for a certain codepage. This should include indirection via $codepage No, this is not the case. $codepage defines the source file encoding. The compiler translates the string constants declared this way to a UTF-16 constant stored within the compiled binary. Specifying $codepage has no implications on runtime conversions by itself. So "const Foo = 'abäö';" is always stored as utf-16? That is something IMHO unexpected. But more to the point var s: AnsiString var s2: UnicodeString var s3: WideString s := Foo; s := 'abäö'; s2 := Foo; s2 := 'abäö'; s3 := Foo; s3 := 'abäö'; Does any of the assignments "s:=" or "s2:=", "s3:=" cause a conversion? (For this it does not matter if this depends or does not depend on a $Codepage / all that matters is, if there is some case in which it causes conversion) If it never causes a conversion, then I misread/misunderstood something. If it does, it is IMHO very unexpected. After all why include a constant in a way that it must still be computed before it can be used? I do not include pi as a formula to be computed at runtime, I define it to the precision I will need (and/or can store) as pre-computed constant of 3.14159 So if that causes a conversion, then that is worth a warning/note. And IMHO it is worth a warning, even if a widestring manager is present. Because that conversion which it causes is most likely not wanted by the user. - This could be given, even if the presence/absence of a widestring manager is not known. Because Because what? Reason above. I hit send accidentality. I then decided to wait, and answer it with the next response (i.e. now) Obviously knowing the presence/absence of a widestring manager allows to refine warnings. But I guess that comes at a higher price, as each unit when compiled could only set flags in the ppu (including forwarding flags from used units). And the compiling the final program would read which warning flags are present, and if any unit flagged the inclusion of a widestring manager. Yes, this would be indeed the only possibility. On 08/03/2021 23:23, Michael Van Canneyt via fpc-pascal wrote: The compiler has no way to know if the widestring manager actually does a complete or even a good job. Maybe it just does logging Even then, the mere fact that the user added a W.M. other than default, would indicate that the user is aware, and hence does not need a hint/warning. Sure the user might not be aware..., but it's to catch common problems, not every border/edge case. Still, I agree that the "unit flag" solution is too costly to implement/maintain. ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On 08/03/2021 7:49 pm, Jonas Maebe via fpc-pascal wrote: > It's not possible to safely use unicodestring without > knowing how 16bit unicode works. The compiler can't solve that. I disagree. Java does just that! The issue is the assumption of using array indexing into the a string. I guess developers should stop doing that. The important point is: But developer should be able to use Unicode strings without needing to know the is and outs of Unicode and UTF-16 encoding. At least that's what's possible with Java and other languages. FPC need to introduce class helpers or something with methods like MyUnicodeString.CharAt(x) and if the char at position x is a surrogate, then return the surrogate. Implicitly include whatever is needed to make that work. Other helper methods could return the Byte or CodePoint at position x - depending on what the developer wants. Naming these methods in a logical way is key, as they become self-documenting. No need for 10 web pages explaining how to work with a [unicode] string. FPC (and Delphi) really need to get with the times. Regards, Graeme ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On 08/03/2021 2:49 pm, Michael Van Canneyt via fpc-pascal wrote: > In that sense, unicode conversion support is something optional and so we > require you to enable it explicitly, since enabling it has some drawbacks: Surely if you explicitly use the UnicodeString type, the compiler should know you are using UTF-16 (the default encoding of said type), so why not include the required units implicitly. It doesn't make sense otherwise. Regards, Graeme ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On 07/03/2021 5:48 pm, Nikolay Nikolov via fpc-pascal wrote: > It depends on what you mean by "just working". No, "just worked" is exactly what it says on the tin. It is FPC that overcomplicating matters. As an example, here is Java that also uses UTF-16 encoding, just like FPC's UnicodeString type. $ cat UnicodeTest.java class UnicodeTest { public static void main(String[] args) { String s = "⌘⌥⌫⇧^"; System.out.println(s); System.out.println(s.charAt(0)); System.out.println(String.format("%x",s.codePointAt(0))); } } Now lets compile and run that. $> javac UnicodeTest.java $> java UnicodeTest Picked up _JAVA_OPTIONS: -Dawt.useSystemAAFontSettings=on ⌘⌥⌫⇧^ ⌘ 2318 Yes, it just worked. And contrary to what Marco was trying to imply, the "Place of Interest" (aka MacOS CMD symbol) is within the BMP, thus only takes up 2 bytes encoded as UTF-16, and should be able to be represented in FPC's Unicode Char type. Regards, Graeme ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On 2021-03-08 21:36, Martin Frb via fpc-pascal wrote: . . In the example the index access should have returned a single codeunit, which was known to be a complete codepoint. As far as I understand the unexpected part was, that the unicode string did not contain the content of the string constant, because the assignment had caused an encoding conversion to be inserted. That conversion caused the need for a widestring manager. Maybe to help the search when/where and whatfor notes/warnings should/could be produced, those implicit conversions can be broken down into groups. I can think of 2 groups already. 1) Conversion due to explicit declared different encoding. AnAnsiString := SomeWideString; AnAsciiString := AnUtf8String; // declared as "type AnsiString(CP_ASCII);" and "type AnsiString(CP_UTF8);" Do you mean a compile-time warning? The trouble is that the compiler wouldn't know whether a real widestring manager would get included in the final binary when such conversions are encountered. And remember that the final binary may be compiled at a different time from the moment when the unit containing such conversions is compiled. In other words, compile-time warnings would be rather difficult to implement. It might be possible to error-out at runtime when such conversions are invoked, but note that technically the conversion may not lead to incorrect results if the string doesn't contain characters beyond US-ASCII. In other word, a run-time error might be appropriate only if the conversion encounters a character it cannot handle. However, adding such a check would probably slow-down processing even for cases when the strings don't contain any problematic characters. 2) Conversion where at least one string is not explicitly declared for a certain codepage. This should include indirection via $codepage No, this is not the case. $codepage defines the source file encoding. The compiler translates the string constants declared this way to a UTF-16 constant stored within the compiled binary. Specifying $codepage has no implications on runtime conversions by itself. Then maybe as a first step, a note/warning could be given, if a constant string is assigned to a variable, and a change of encoding is needed for this. - "constant string" here would be any string that does not have a direct explicit declared encoding. - This could be given, even if the presence/absence of a widestring manager is not known. Because Because what? Obviously knowing the presence/absence of a widestring manager allows to refine warnings. But I guess that comes at a higher price, as each unit when compiled could only set flags in the ppu (including forwarding flags from used units). And the compiling the final program would read which warning flags are present, and if any unit flagged the inclusion of a widestring manager. Yes, this would be indeed the only possibility. Tomas ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On Mon, 8 Mar 2021, Martin Frb via fpc-pascal wrote: Obviously knowing the presence/absence of a widestring manager allows to refine warnings. It does not. The compiler has no way to know if the widestring manager actually does a complete or even a good job. Maybe it just does logging and calls the previously registered widestringmanager. Maybe it replaces all with a single chinese character for testing purposes, or replaces everything with a 0. Michael. ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On 08/03/2021 20:49, Jonas Maebe via fpc-pascal wrote: On 08/03/2021 19:16, Ryan Joseph via fpc-pascal wrote: I agree it would be nice to have some warning that indexing the unicodeString wouldn't work as expected. Then the compiler would have to give a warning for any indexing of unicodestring. That would render it useless, because everyone would just turn it off. It's not possible to safely use unicodestring without knowing how 16bit unicode works. The compiler can't solve that. Indexed access to a string, is different from implicitly inserted call to encoding conversions. In the example the index access should have returned a single codeunit, which was known to be a complete codepoint. As far as I understand the unexpected part was, that the unicode string did not contain the content of the string constant, because the assignment had caused an encoding conversion to be inserted. That conversion caused the need for a widestring manager. Maybe to help the search when/where and whatfor notes/warnings should/could be produced, those implicit conversions can be broken down into groups. I can think of 2 groups already. 1) Conversion due to explicit declared different encoding. AnAnsiString := SomeWideString; AnAsciiString := AnUtf8String; // declared as "type AnsiString(CP_ASCII);" and "type AnsiString(CP_UTF8);" 2) Conversion where at least one string is not explicitly declared for a certain codepage. This should include indirection via $codepage Then maybe as a first step, a note/warning could be given, if a constant string is assigned to a variable, and a change of encoding is needed for this. - "constant string" here would be any string that does not have a direct explicit declared encoding. - This could be given, even if the presence/absence of a widestring manager is not known. Because Obviously knowing the presence/absence of a widestring manager allows to refine warnings. But I guess that comes at a higher price, as each unit when compiled could only set flags in the ppu (including forwarding flags from used units). And the compiling the final program would read which warning flags are present, and if any unit flagged the inclusion of a widestring manager. ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On 08/03/2021 19:16, Ryan Joseph via fpc-pascal wrote: > I agree it would be nice to have some warning that indexing the unicodeString > wouldn't work as expected. Then the compiler would have to give a warning for any indexing of unicodestring. That would render it useless, because everyone would just turn it off. It's not possible to safely use unicodestring without knowing how 16bit unicode works. The compiler can't solve that. Jonas ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
So I was indeed able to solve the problem using {$codepage utf8} and using the CWString unit. Does this do anything besides change the backend of the UnicodeString/UnicodeChar type? I using other string types in that unit and I'm curious if I've put some kind of performance burden on the other strings. I agree it would be nice to have some warning that indexing the unicodeString wouldn't work as expected. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On Mon, 8 Mar 2021, Tomas Hajny via fpc-pascal wrote: On 2021-03-08 15:49, Michael Van Canneyt via fpc-pascal wrote: On Mon, 8 Mar 2021, Adriaan van Os via fpc-pascal wrote: Michael Van Canneyt via fpc-pascal wrote: You didn't configure your environment to deal correctly with Unicode. Wow ! what a sentence ! That sounds like "you didn't configure your car correctly to also take corners to the right." A car that does not turn is unusable. Programs that don't need unicode conversions exist and are perfectly usable. In that sense, unicode conversion support is something optional and so we require you to enable it explicitly, since enabling it has some drawbacks: - Links to C libs if you use cwstring - Increases your binary substantually if you use fpwidestring and include all needed characters. The trouble is - when exactly should the supposed warning be issued? At compile time if there are Unicodestring variables and/or constants involved, but the Widestring manager is not included in the final binary Provided you can detect to begin with that a "real" widestring manager is included in the final binary... Michael. ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On Mon, 8 Mar 2021, Adriaan van Os via fpc-pascal wrote: Michael Van Canneyt wrote: The output for me is the same, regardless of the -FcUTF-8 flag being present or not: question marks. But if I add uses cwstring; all will be well. Rationale: Without that, the RTL cannot convert whatever the compiler wrote in the binary to UTF8 to display it on the console. The compiler people will need to explain what exactly the compiler writes with or without the flag. Well, this should at least produce a warning, if not an error. Silently producing the wrong code is not a good idea. Strictly speaking, there is no wrong code produced: You didn't configure your environment to deal correctly with Unicode. You're using the default widestring manager, which simply skips any non-ascii characters. All this is documented in various places, for example: https://www.freepascal.org/docs-html/rtl/system/unicodesupport.html Michael. ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
Michael Van Canneyt wrote: The output for me is the same, regardless of the -FcUTF-8 flag being present or not: question marks. But if I add uses cwstring; all will be well. Rationale: Without that, the RTL cannot convert whatever the compiler wrote in the binary to UTF8 to display it on the console. The compiler people will need to explain what exactly the compiler writes with or without the flag. Well, this should at least produce a warning, if not an error. Silently producing the wrong code is not a good idea. Regards, Adriaan van Os ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On 2021-03-08 11:59, Adriaan van Os via fpc-pascal wrote: Hi, adriaan% cat uniquizz-utf8.pas {$codepage utf8} program uniquizz; var chars: UnicodeString; begin chars := '⌘ key'; writeln(chars); writeln(chars[1]); writeln( 'size ', sizeOf( chars)); writeln( 'length ', length( chars)); end. adriaan% fpc uniquizz-utf8.pas -FcUTF-8 Free Pascal Compiler version 3.0.4 [2018/09/30] for x86_64 Copyright (c) 1993-2017 by Florian Klaempfl and others Target OS: Darwin for x86_64 Compiling uniquizz-utf8.pas Assembling (pipe) uniquizz-utf8.s Linking uniquizz-utf8 14 lines compiled, 0.1 sec [l24:~/gpc/testfpc] adriaan% ./uniquizz-utf8 ? key ? size 8 length 5 This leaves me with a question mark too. UnicodeString is a pointer from technical point of view, SizeOf (UnicodeString) thus always returns 8 on 64-bit platforms regardless of the string content. Michael already answered regarding the question mark output - you need a widestring manager to translate the character from the internal storage (UTF-16 - see uniquizz-utf8.s if compiled with -a) to your terminal charset. Tomas ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On Mon, 8 Mar 2021, Adriaan van Os via fpc-pascal wrote: adriaan% cat uniquizz-utf8.pas {$codepage utf8} program uniquizz; var chars: UnicodeString; begin chars := '⌘ key'; writeln(chars); writeln(chars[1]); writeln( 'size ', sizeOf( chars)); writeln( 'length ', length( chars)); end. adriaan% fpc uniquizz-utf8.pas -FcUTF-8 Free Pascal Compiler version 3.0.4 [2018/09/30] for x86_64 Copyright (c) 1993-2017 by Florian Klaempfl and others Target OS: Darwin for x86_64 Compiling uniquizz-utf8.pas Assembling (pipe) uniquizz-utf8.s Linking uniquizz-utf8 14 lines compiled, 0.1 sec [l24:~/gpc/testfpc] adriaan% ./uniquizz-utf8 ? key ? size 8 length 5 This leaves me with a question mark too. The output for me is the same, regardless of the -FcUTF-8 flag being present or not: question marks. But if I add uses cwstring; all will be well. Rationale: Without that, the RTL cannot convert whatever the compiler wrote in the binary to UTF8 to display it on the console. The compiler people will need to explain what exactly the compiler writes with or without the flag. Michael.___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
adriaan% cat uniquizz-utf8.pas {$codepage utf8} program uniquizz; var chars: UnicodeString; begin chars := '⌘ key'; writeln(chars); writeln(chars[1]); writeln( 'size ', sizeOf( chars)); writeln( 'length ', length( chars)); end. adriaan% fpc uniquizz-utf8.pas -FcUTF-8 Free Pascal Compiler version 3.0.4 [2018/09/30] for x86_64 Copyright (c) 1993-2017 by Florian Klaempfl and others Target OS: Darwin for x86_64 Compiling uniquizz-utf8.pas Assembling (pipe) uniquizz-utf8.s Linking uniquizz-utf8 14 lines compiled, 0.1 sec [l24:~/gpc/testfpc] adriaan% ./uniquizz-utf8 ? key ? size 8 length 5 This leaves me with a question mark too. Regards, Adriaan van Os ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
Op 2021-03-07 om 22:26 schreef Bart via fpc-pascal: On Sun, Mar 7, 2021 at 5:31 PM Marco van de Voort via fpc-pascal wrote: Probably it is not in the BMP and thus needs more position than one. Length(Char) is 5 according to fpc, I see 5 "graphemes" Indeed: .Ld1$strlab: .short 1200,2 .long -1,5 .Ld1: .short 8984,8997,9003,8679,94,0 On win32 a quick test is hard since displaying unicode in the terminal is hard. But a write for "widechar" is called: movl U_$P$PROGRAM_$$_CHARS,%eax movw (%eax),%cx movl %ebx,%edx movl $0,%eax call fpc_write_text_widechar so it should be ok then. ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On Sun, Mar 7, 2021 at 5:31 PM Marco van de Voort via fpc-pascal wrote: > Probably it is not in the BMP and thus needs more position than one. Length(Char) is 5 according to fpc, I see 5 "graphemes", which suggest that all of them fit into 1 WideChar? -- Bart ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
On 3/7/21 7:21 PM, Ryan Joseph via fpc-pascal wrote: On Mar 7, 2021, at 10:11 AM, Marco van de Voort via fpc-pascal wrote: Yes it is. And there are about 1114000 unicode codepoints, or about 17 times what fits in a 2-byte wide char. https://en.wikipedia.org/wiki/Code_point https://en.wikipedia.org/wiki/UTF-16 I thought unicode strings "just worked" but maybe that's UTF-8 and the character I want is maybe UTF-16. What are you supposed to do then? UnicodeString knows how to print the full string so all the data is there but I can't index to get characters unless I know their size. It depends on what you mean by "just working". UnicodeString is an UTF-16 encoded string and a WideChar is just a UTF-16 code unit. Both UTF-8 and UTF-16 are variable length encodings. UTF-16 is just more simple to decode. Note also that, even though a single Unicode codepoint might need two UTF-16 code units (i.e. WideChars), that is still not enough to represent what users perceive as a character. There are also plenty of Unicode combining characters. What most users perceive as a character is actually called an Extended Grapheme Cluster and is actually a sequence of Unicode code points. There's an algorithm (an enumerator) that splits a string into grapheme clusters, and that's implemented in FPC trunk in the GraphemeBreakProperty unit. It implements this algorithm: http://www.unicode.org/reports/tr29/ This was done by me for the Unicode Free Vision port in the unicodekvm SVN branch, but it was already committed to trunk (the rest of the Unicode Free Vision still isn't), because it's a new unit that is relatively self-contained and provides new functionality (so, won't break existing code) that wasn't provided by the RTL before. Note that normally, most programs wouldn't actually need to split a string into grapheme clusters, unless they implement something like a UI toolkit or a text editor or something of that sort. That's why it was needed for the Unicode Free Vision. Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
> On Mar 7, 2021, at 10:21 AM, Ryan Joseph wrote: > > I thought unicode strings "just worked" but maybe that's UTF-8 and the > character I want is maybe UTF-16. What are you supposed to do then? > UnicodeString knows how to print the full string so all the data is there but > I can't index to get characters unless I know their size. Since this looks like it could be complicated here is what I was actually trying to do with the FreeType library. This works for ASCII but broke down with those unicode chars. I'm confused now because you say the character are more than 2 bytes so I don't know what the actual size of an element is. for glyph in '⌘⌥⌫⇧^' do FT_Load_Char(m_face, ord(glyph), FT_LOAD_RENDER); Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
> On Mar 7, 2021, at 10:11 AM, Marco van de Voort via fpc-pascal > wrote: > > > Yes it is. And there are about 1114000 unicode codepoints, or about 17 times > what fits in a 2-byte wide char. > > https://en.wikipedia.org/wiki/Code_point > > https://en.wikipedia.org/wiki/UTF-16 I thought unicode strings "just worked" but maybe that's UTF-8 and the character I want is maybe UTF-16. What are you supposed to do then? UnicodeString knows how to print the full string so all the data is there but I can't index to get characters unless I know their size. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
Op 2021-03-07 om 17:38 schreef Ryan Joseph via fpc-pascal: On Mar 7, 2021, at 9:31 AM, Marco van de Voort via fpc-pascal wrote: Probably it is not in the BMP and thus needs more position than one. Isn't char[1] a 2 byte wide char? Not sure I understand "more position than on" though. Yes it is. And there are about 1114000 unicode codepoints, or about 17 times what fits in a 2-byte wide char. https://en.wikipedia.org/wiki/Code_point https://en.wikipedia.org/wiki/UTF-16 ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
> On Mar 7, 2021, at 9:31 AM, Marco van de Voort via fpc-pascal > wrote: > > Probably it is not in the BMP and thus needs more position than one. Isn't char[1] a 2 byte wide char? Not sure I understand "more position than on" though. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Unicode chars losing information
Op 2021-03-07 om 17:21 schreef Ryan Joseph via fpc-pascal: I came across a bug which was caused but a unicode character losing information and narrowed it down to this. Why doesn't the chars[1] print the same character as appeared in the string? var chars: UnicodeString; begin chars := '⌘⌥⌫⇧^'; writeln(chars); writeln(chars[1]); end. Prints: ⌘⌥⌫⇧^ ? Probably it is not in the BMP and thus needs more position than one. ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
[fpc-pascal] Unicode chars losing information
I came across a bug which was caused but a unicode character losing information and narrowed it down to this. Why doesn't the chars[1] print the same character as appeared in the string? var chars: UnicodeString; begin chars := '⌘⌥⌫⇧^'; writeln(chars); writeln(chars[1]); end. Prints: ⌘⌥⌫⇧^ ? Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal