Re: [fpc-pascal] Unicode chars losing information

2021-03-09 Thread Michael Van Canneyt via fpc-pascal



On Tue, 9 Mar 2021, Florian Klämpfl via fpc-pascal wrote:


By using the necessary IFDEF mechanism in the config file, we can avoid
inserting it for windows (which does not need it) or the smaller embedded 
platforms
(which cannot handle it).

People that don't need/want this can remove the config setting from the file. 
All the others leave it as-is and will get their desired conversion mechanisms
'for free'.

This way a default choice is made for you on those platforms, but you can still 
100% control
it.


I am very much against this because this means that a default FPC
executable would link against libc.  And this is far too much only because
a few people complain because they didn’t read the docs.


Well, maybe the Lazarus IDE can insert the necessary units, just like it is
done for cthreads...

Michael.___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-09 Thread Florian Klämpfl via fpc-pascal


> Am 09.03.2021 um 10:06 schrieb Michael Van Canneyt via fpc-pascal 
> :
> 
> 
> 
>> On Tue, 9 Mar 2021, Graeme Geldenhuys via fpc-pascal wrote:
>> 
>>> On 09/03/2021 1:44 am, Tomas Hajny via fpc-pascal wrote:
>>> UnicodeString may be used in a program simply because the included unit has 
>>> it used in its interface. That may be the case even if there's no use of 
>>> characters outside of US ASCII at all.
>> 
>> So FPC rather goes with the fact that data may be *silently* lost during
>> encoding conversions? That doesn't seem like a safe default behaviour to
>> me.
> 
> No, we give the programmer a choice: * Not use unicode conversion at all.
> * Use the C library to handle conversion (cwstring).
> * Use FPC native code to handle conversion (fpwidestring).
> * Some other means.
> 
> Since the compiler cannot reliably detect that a choice was made, it also 
> cannot make the choice for you, because the choice also cannot be undone by 
> the compiler.
> 
> This mechanism implies the programmer *has* to make that choice.
> 
> This is not different from the threading driver mechanism, for which Lazarus 
> adds
> some {$IFDEF } mechanisms in the program uses clause.
> 
> But, I have been thinking about this. What we can do to alleviate this is the 
> following:
> 
> Use the -FaNNN option of the command line.
> 
> This option will insert NNN implicitly in the uses clause of the program.
> 
> So, we can add -Fafpwidestring
> or
> -Facwstring
> 
> in the default generated fpc.cfg config file for selected platforms (mac, 
> linux
> i386,64-bit, *bsd). The result will be that a widestring driver unit will be 
> inserted by default for those platforms.
> 
> By using the necessary IFDEF mechanism in the config file, we can avoid
> inserting it for windows (which does not need it) or the smaller embedded 
> platforms
> (which cannot handle it).
> 
> People that don't need/want this can remove the config setting from the file. 
> All the others leave it as-is and will get their desired conversion mechanisms
> 'for free'.
> 
> This way a default choice is made for you on those platforms, but you can 
> still 100% control
> it.

I am very much against this because this means that a default FPC executable 
would link against libc. And this is far too much only because a few people 
complain because they didn’t read the docs.

> 
> Michael.
> ___
> fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-09 Thread Tomas Hajny via fpc-pascal

On 2021-03-09 09:46, Graeme Geldenhuys via fpc-pascal wrote:

On 09/03/2021 1:44 am, Tomas Hajny via fpc-pascal wrote:
UnicodeString may be used in a program simply because the included 
unit

has it used in its interface. That may be the case even if there's no
use of characters outside of US ASCII at all.


So FPC rather goes with the fact that data may be *silently* lost 
during
encoding conversions? That doesn't seem like a safe default behaviour 
to

me.


The same happens e.g. if you configure your terminal to use a font that 
doesn't contain all the characters which may appear in the output - the 
compiler cannot know all the circumstances and thus cannot handle all of 
them; among others due to the fact that there are decisions to be made 
based on weighing pros and cons in the particular use case and those 
simply aren't 'one size fits all'.


Tomas
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-09 Thread Michael Van Canneyt via fpc-pascal



On Tue, 9 Mar 2021, Mattias Gaertner via fpc-pascal wrote:


On Tue, 9 Mar 2021 08:04:54 +0100
Sven Barth via fpc-pascal  wrote:


[...]
FPC is not Java. In FPC you have more fine-grained control over the
resulting binary than "install big, fat runtime". Not to mention that
FPC can target resource constrained systems as well.


Optional is good.

Maybe the defaults can be changed. For example the macOS dmg and
Linux-x86-64 debs/rpms could install an fpc.cfg containing

#ifndef FPNonUnicode
-Facwstring
-Fcutf-8
#endif

For minimal programs pass -dFPNonUnicode


Our mails crossed.

That corresponds to what I proposed, with minor differences. 
The additional #ifndef FPNonUnicode is also a good idea.


Michael.
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-09 Thread Michael Van Canneyt via fpc-pascal



On Tue, 9 Mar 2021, Graeme Geldenhuys via fpc-pascal wrote:


On 09/03/2021 1:44 am, Tomas Hajny via fpc-pascal wrote:
UnicodeString may be used in a program simply because the included unit 
has it used in its interface. That may be the case even if there's no 
use of characters outside of US ASCII at all.


So FPC rather goes with the fact that data may be *silently* lost during
encoding conversions? That doesn't seem like a safe default behaviour to
me.


No, we give the programmer a choice: 
* Not use unicode conversion at all.

* Use the C library to handle conversion (cwstring).
* Use FPC native code to handle conversion (fpwidestring).
* Some other means.

Since the compiler cannot reliably detect that a choice was made, 
it also cannot make the choice for you, because the choice also cannot 
be undone by the compiler.


This mechanism implies the programmer *has* to make that choice.

This is not different from the threading driver mechanism, for which Lazarus 
adds
some {$IFDEF } mechanisms in the program uses clause.

But, I have been thinking about this. What we can do to alleviate this is the 
following:

Use the -FaNNN option of the command line.

This option will insert NNN implicitly in the uses clause of the program.

So, we can add 
-Fafpwidestring

or
-Facwstring

in the default generated fpc.cfg config file for selected platforms (mac, linux
i386,64-bit, *bsd). The result will be that a widestring driver unit will be 
inserted by default for those platforms.


By using the necessary IFDEF mechanism in the config file, we can avoid
inserting it for windows (which does not need it) or the smaller embedded 
platforms
(which cannot handle it).

People that don't need/want this can remove the config setting from the file. 
All the others leave it as-is and will get their desired conversion mechanisms

'for free'.

This way a default choice is made for you on those platforms, but you can still 
100% control
it.

Michael.
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-09 Thread Mattias Gaertner via fpc-pascal
On Tue, 9 Mar 2021 08:04:54 +0100
Sven Barth via fpc-pascal  wrote:

>[...]
> FPC is not Java. In FPC you have more fine-grained control over the
> resulting binary than "install big, fat runtime". Not to mention that
> FPC can target resource constrained systems as well.

Optional is good.

Maybe the defaults can be changed. For example the macOS dmg and
Linux-x86-64 debs/rpms could install an fpc.cfg containing

#ifndef FPNonUnicode
-Facwstring
-Fcutf-8
#endif

For minimal programs pass -dFPNonUnicode

Mattias
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-09 Thread Graeme Geldenhuys via fpc-pascal
On 09/03/2021 1:44 am, Tomas Hajny via fpc-pascal wrote:
> UnicodeString may be used in a program simply because the included unit 
> has it used in its interface. That may be the case even if there's no 
> use of characters outside of US ASCII at all.

So FPC rather goes with the fact that data may be *silently* lost during
encoding conversions? That doesn't seem like a safe default behaviour to
me.

Regards,
  Graeme
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Michael Van Canneyt via fpc-pascal



On Tue, 9 Mar 2021, Graeme Geldenhuys via fpc-pascal wrote:



On 08/03/2021 2:49 pm, Michael Van Canneyt via fpc-pascal wrote:
In that sense, unicode conversion support is something optional and so we 
require you to enable it explicitly, since enabling it has some drawbacks:


Surely if you explicitly use the UnicodeString type, the compiler should
know you are using UTF-16 (the default encoding of said type), so why not
include the required units implicitly. It doesn't make sense otherwise.


The system unit is full of unicodestring typed routines, same for sysutils.
Mostly they are overloads of single-byte versions of the same call.

Being on Linux, I use only the UTF8 single-byte version of these calls.

So no, I don't need UTF16 despite that these calls are present in units that I 
am
using. So I know this, but the compiler does not.

Maybe with WPO the compiler would be able to deduce it, but even then I am
not sure it can establish this with 100% certainty.

Michael.
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Sven Barth via fpc-pascal
Graeme Geldenhuys via fpc-pascal  schrieb
am Di., 9. März 2021, 00:56:

>
> On 07/03/2021 5:48 pm, Nikolay Nikolov via fpc-pascal wrote:
> > It depends on what you mean by "just working".
>
> No, "just worked" is exactly what it says on the tin. It is FPC that
> overcomplicating matters.
>
>
> As an example, here is Java that also uses UTF-16 encoding, just like
> FPC's UnicodeString type.
>
> 
> $ cat UnicodeTest.java
> class UnicodeTest {
>
> public static void main(String[] args) {
> String s = "⌘⌥⌫⇧^";
> System.out.println(s);
> System.out.println(s.charAt(0));
> System.out.println(String.format("%x",s.codePointAt(0)));
> }
> }
> 
>
> Now lets compile and run that.
>
> $> javac UnicodeTest.java
> $> java UnicodeTest
> Picked up _JAVA_OPTIONS: -Dawt.useSystemAAFontSettings=on
> ⌘⌥⌫⇧^
> ⌘
> 2318
>
> Yes, it just worked.
>

That is because the Java runtime contains all the conversion code
necessary. In FPC we simply don't do that, cause it either requires linking
to the C library (especially for simple utilities that can be easily
avoided) or requires a huge amount of conversion tables. Thus developers
need to explicitly opt in for using Unicode conversions by including a
WideString manager.

FPC is not Java. In FPC you have more fine-grained control over the
resulting binary than "install big, fat runtime". Not to mention that FPC
can target resource constrained systems as well.

Regards,
Sven
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Nikolay Nikolov via fpc-pascal


On 3/9/21 2:18 AM, Graeme Geldenhuys via fpc-pascal wrote:

On 08/03/2021 7:49 pm, Jonas Maebe via fpc-pascal wrote:

It's not possible to safely use unicodestring without
knowing how 16bit unicode works. The compiler can't solve that.

I disagree. Java does just that! The issue is the assumption of using
array indexing into the a string. I guess developers should stop doing
that.

The important point is:
But developer should be able to use Unicode strings without needing
to know the is and outs of Unicode and UTF-16 encoding. At least
that's what's possible with Java and other languages.


Yes, you absolutely need to know the ins and outs of Unicode in order to 
know how to extract the first character of a string. First of all, what 
is a character? A UTF-16 code unit, a Unicode code point or an extended 
grapheme cluster? Your Java code only does the expected thing for a 
certain subset of characters. If you write your code like that, you're 
going to think your code works, but it would fail on strings with either 
non-BMP characters (if you use charAt) or strings with combining 
characters (if you use codePointAt). To split the string into user 
perceived characters you need to do this in FPC trunk:


uses

  graphemebreakproperty, fpwidestring;

var

  EGC, S: UnicodeString;

begin

  S := 'Хей, помисли́ си!';

  for EGC in TUnicodeStringExtendedGraphemeClustersEnumerator.Create(S) do
    Writeln(EGC);

end;


Can Java do that? No, it appears it can't:

https://stackoverflow.com/questions/40878804/how-to-count-grapheme-clusters-or-perceived-emoji-characters-in-java


Neither charAt, nor codePointAt will work for the 'и́'. CharAt will also 
fail at ''. Please correct me if I'm wrong, I didn't test this in Java.




FPC (and Delphi) really need to get with the times.


If by "get with the times" you mean always include the fpwidestring unit 
and still produce less bloat than the JVM, then sure, we can do that, 
but some people appreciate the flexibility of choosing your own wide 
string manager or not including it for programs that don't need it.


And for things like splitting a string into characters, you really need 
to know what you're doing anyway, since a Unicode codepoint very rarely 
corresponds to what users perceive as a character.


Nikolay


___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Martin Frb via fpc-pascal

On 08/03/2021 23:26, Tomas Hajny via fpc-pascal wrote:

On 2021-03-08 21:36, Martin Frb via fpc-pascal wrote:



I can think of 2 groups already.
1) Conversion due to explicit declared different encoding.
   AnAnsiString := SomeWideString;
  AnAsciiString := AnUtf8String; // declared as "type
AnsiString(CP_ASCII);" and "type AnsiString(CP_UTF8);"


Do you mean a compile-time warning? The trouble is that the compiler 
wouldn't know whether a real widestring manager would get included in 
the final binary when such conversions are encountered. And remember 
that the final binary may be compiled at a different time from the 
moment when the unit containing such conversions is compiled. In other 
words, compile-time warnings would be rather difficult to implement.

Yes, I mean a compile time warning.
But, not in the above case. In the above case the users could kind of 
reasonably be expected to know a widestring manager is needed.


However, IMHO that differs in the below case:


2) Conversion where at least one string is not explicitly declared for
a certain codepage.
   This should include indirection via $codepage


No, this is not the case. $codepage defines the source file encoding. 
The compiler translates the string constants declared this way to a 
UTF-16 constant stored within the compiled binary. Specifying 
$codepage has no implications on runtime conversions by itself.

So "const Foo = 'abäö';" is always stored as utf-16?
That is something IMHO unexpected.

But more to the point
  var s: AnsiString
  var s2: UnicodeString
  var s3: WideString
  s := Foo;
  s := 'abäö';
  s2 := Foo;
  s2 := 'abäö';
  s3 := Foo;
  s3 := 'abäö';

Does any of the assignments  "s:=" or "s2:=", "s3:=" cause a conversion?
(For this it does not matter if this depends or does not depend on a 
$Codepage / all that matters is, if there is some case in which it 
causes conversion)


If it never causes a conversion, then I misread/misunderstood something.

If it does, it is IMHO very unexpected. After all why include a constant 
in a way that it must still be computed before it can be used?
I do not include pi as a formula to be computed at runtime, I define it 
to the precision I will need (and/or can store) as pre-computed constant 
of 3.14159


So if that causes a conversion, then that is worth a warning/note.
And IMHO it is worth a warning, even if a widestring manager is present. 
Because that conversion which it causes is most likely not wanted by the 
user.




- This could be given, even if the presence/absence of a widestring
manager is not known. Because


Because what?

Reason above.
I hit send accidentality. I then decided to wait, and answer it with the 
next response (i.e. now)




Obviously knowing the presence/absence of a widestring manager allows
to refine warnings.
But I guess that comes at a higher price, as each unit when compiled
could only set flags in the ppu (including forwarding flags from used
units).
And the compiling the final program would read which warning flags are
present, and if any unit flagged the inclusion of a widestring
manager.


Yes, this would be indeed the only possibility.


On 08/03/2021 23:23, Michael Van Canneyt via fpc-pascal wrote:

The compiler has no way to know if the widestring manager actually does a
complete or even a good job. Maybe it just does logging


Even then, the mere fact that the user added a W.M. other than default, 
would indicate that the user is aware, and hence does not need a 
hint/warning.
Sure the user might not be aware..., but it's to catch common problems, 
not every border/edge case.


Still, I agree that the "unit flag" solution is too costly to 
implement/maintain.









___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Graeme Geldenhuys via fpc-pascal

On 08/03/2021 7:49 pm, Jonas Maebe via fpc-pascal wrote:
> It's not possible to safely use unicodestring without
> knowing how 16bit unicode works. The compiler can't solve that.

I disagree. Java does just that! The issue is the assumption of using
array indexing into the a string. I guess developers should stop doing
that.

The important point is:
But developer should be able to use Unicode strings without needing
to know the is and outs of Unicode and UTF-16 encoding. At least
that's what's possible with Java and other languages.

FPC need to introduce class helpers or something with methods like
MyUnicodeString.CharAt(x) and if the char at position x is a
surrogate, then return the surrogate. Implicitly include whatever is
needed to make that work. Other helper methods could return
the Byte or CodePoint at position x - depending on what the developer
wants. Naming these methods in a logical way is key, as they become
self-documenting. No need for 10 web pages explaining how to work with
a [unicode] string.

FPC (and Delphi) really need to get with the times.


Regards,
  Graeme
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Graeme Geldenhuys via fpc-pascal

On 08/03/2021 2:49 pm, Michael Van Canneyt via fpc-pascal wrote:
> In that sense, unicode conversion support is something optional and so we 
> require you to enable it explicitly, since enabling it has some drawbacks:

Surely if you explicitly use the UnicodeString type, the compiler should
know you are using UTF-16 (the default encoding of said type), so why not
include the required units implicitly. It doesn't make sense otherwise.


Regards,
  Graeme
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Graeme Geldenhuys via fpc-pascal

On 07/03/2021 5:48 pm, Nikolay Nikolov via fpc-pascal wrote:
> It depends on what you mean by "just working".

No, "just worked" is exactly what it says on the tin. It is FPC that
overcomplicating matters.


As an example, here is Java that also uses UTF-16 encoding, just like
FPC's UnicodeString type.


$ cat UnicodeTest.java
class UnicodeTest {

public static void main(String[] args) {
String s = "⌘⌥⌫⇧^";
System.out.println(s);
System.out.println(s.charAt(0));
System.out.println(String.format("%x",s.codePointAt(0)));
}
}


Now lets compile and run that.

$> javac UnicodeTest.java
$> java UnicodeTest
Picked up _JAVA_OPTIONS: -Dawt.useSystemAAFontSettings=on
⌘⌥⌫⇧^
⌘
2318

Yes, it just worked.

And contrary to what Marco was trying to imply, the "Place of Interest"
(aka MacOS CMD symbol) is within the BMP, thus only takes up 2 bytes
encoded as UTF-16, and should be able to be represented in FPC's
Unicode Char type.


Regards,
  Graeme
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Tomas Hajny via fpc-pascal

On 2021-03-08 21:36, Martin Frb via fpc-pascal wrote:
 .
 .

In the example the index access should have returned a single
codeunit, which was known to be a complete codepoint.
As far as I understand the unexpected part was, that the unicode
string did not contain the content of the string constant, because the
assignment had caused an encoding conversion to be inserted.
That conversion caused the need for a widestring manager.

Maybe to help the search when/where and whatfor notes/warnings
should/could be produced, those implicit conversions can be broken
down into groups.
I can think of 2 groups already.
1) Conversion due to explicit declared different encoding.
   AnAnsiString := SomeWideString;
  AnAsciiString := AnUtf8String; // declared as "type
AnsiString(CP_ASCII);" and "type AnsiString(CP_UTF8);"


Do you mean a compile-time warning? The trouble is that the compiler 
wouldn't know whether a real widestring manager would get included in 
the final binary when such conversions are encountered. And remember 
that the final binary may be compiled at a different time from the 
moment when the unit containing such conversions is compiled. In other 
words, compile-time warnings would be rather difficult to implement. It 
might be possible to error-out at runtime when such conversions are 
invoked, but note that technically the conversion may not lead to 
incorrect results if the string doesn't contain characters beyond 
US-ASCII. In other word, a run-time error might be appropriate only if 
the conversion encounters a character it cannot handle. However, adding 
such a check would probably slow-down processing even for cases when the 
strings don't contain any problematic characters.




2) Conversion where at least one string is not explicitly declared for
a certain codepage.
   This should include indirection via $codepage


No, this is not the case. $codepage defines the source file encoding. 
The compiler translates the string constants declared this way to a 
UTF-16 constant stored within the compiled binary. Specifying $codepage 
has no implications on runtime conversions by itself.




Then maybe as a first step, a note/warning could be given, if a
constant string is assigned to a variable, and a change of encoding is
needed for this.
- "constant string" here would be any string that does not have a
direct explicit declared encoding.
- This could be given, even if the presence/absence of a widestring
manager is not known. Because


Because what?



Obviously knowing the presence/absence of a widestring manager allows
to refine warnings.
But I guess that comes at a higher price, as each unit when compiled
could only set flags in the ppu (including forwarding flags from used
units).
And the compiling the final program would read which warning flags are
present, and if any unit flagged the inclusion of a widestring
manager.


Yes, this would be indeed the only possibility.

Tomas
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Michael Van Canneyt via fpc-pascal



On Mon, 8 Mar 2021, Martin Frb via fpc-pascal wrote:

Obviously knowing the presence/absence of a widestring manager allows to 
refine warnings.


It does not.

The compiler has no way to know if the widestring manager actually does a
complete or even a good job. Maybe it just does logging and calls the
previously registered widestringmanager. 
Maybe it replaces all with a single chinese character for testing purposes, 
or replaces everything with a 0.


Michael.
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Martin Frb via fpc-pascal

On 08/03/2021 20:49, Jonas Maebe via fpc-pascal wrote:

On 08/03/2021 19:16, Ryan Joseph via fpc-pascal wrote:

I agree it would be nice to have some warning that indexing the unicodeString 
wouldn't work as expected.

Then the compiler would have to give a warning for any indexing of
unicodestring. That would render it useless, because everyone would just
turn it off. It's not possible to safely use unicodestring without
knowing how 16bit unicode works. The compiler can't solve that.




Indexed access to a string, is different from implicitly inserted call 
to encoding conversions.


In the example the index access should have returned a single codeunit, 
which was known to be a complete codepoint.
As far as I understand the unexpected part was, that the unicode string 
did not contain the content of the string constant, because the 
assignment had caused an encoding conversion to be inserted.

That conversion caused the need for a widestring manager.

Maybe to help the search when/where and whatfor notes/warnings 
should/could be produced, those implicit conversions can be broken down 
into groups.

I can think of 2 groups already.
1) Conversion due to explicit declared different encoding.
   AnAnsiString := SomeWideString;
  AnAsciiString := AnUtf8String; // declared as "type 
AnsiString(CP_ASCII);" and "type AnsiString(CP_UTF8);"
2) Conversion where at least one string is not explicitly declared for a 
certain codepage.

   This should include indirection via $codepage


Then maybe as a first step, a note/warning could be given, if a constant 
string is assigned to a variable, and a change of encoding is needed for 
this.
- "constant string" here would be any string that does not have a direct 
explicit declared encoding.
- This could be given, even if the presence/absence of a widestring 
manager is not known. Because




Obviously knowing the presence/absence of a widestring manager allows to 
refine warnings.
But I guess that comes at a higher price, as each unit when compiled 
could only set flags in the ppu (including forwarding flags from used 
units).
And the compiling the final program would read which warning flags are 
present, and if any unit flagged the inclusion of a widestring manager.





___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Jonas Maebe via fpc-pascal
On 08/03/2021 19:16, Ryan Joseph via fpc-pascal wrote:
> I agree it would be nice to have some warning that indexing the unicodeString 
> wouldn't work as expected.

Then the compiler would have to give a warning for any indexing of
unicodestring. That would render it useless, because everyone would just
turn it off. It's not possible to safely use unicodestring without
knowing how 16bit unicode works. The compiler can't solve that.


Jonas
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Ryan Joseph via fpc-pascal

So I was indeed able to solve the problem using {$codepage utf8} and using the 
CWString unit. Does this do anything besides change the backend of the 
UnicodeString/UnicodeChar type? I using other string types in that unit and I'm 
curious if I've put some kind of performance burden on the other strings.

I agree it would be nice to have some warning that indexing the unicodeString 
wouldn't work as expected.


Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Michael Van Canneyt via fpc-pascal



On Mon, 8 Mar 2021, Tomas Hajny via fpc-pascal wrote:


On 2021-03-08 15:49, Michael Van Canneyt via fpc-pascal wrote:

On Mon, 8 Mar 2021, Adriaan van Os via fpc-pascal wrote:

Michael Van Canneyt via fpc-pascal wrote:


You didn't configure your environment to deal correctly with Unicode.


Wow ! what a sentence !

That sounds like "you didn't configure your car correctly to also take 
corners to the right."


A car that does not turn is unusable.

Programs that don't need unicode conversions exist and are perfectly 
usable.


In that sense, unicode conversion support is something optional and so
we require you to enable it explicitly, since enabling it has some
drawbacks:
- Links to C libs if you use cwstring
- Increases your binary substantually if you use fpwidestring and
include all needed characters.


The trouble is - when exactly should the supposed warning be issued? At 
compile time if there are Unicodestring variables and/or constants 
involved, but the Widestring manager is not included in the final binary


Provided you can detect to begin with that a "real" widestring manager 
is included in the final binary...


Michael.
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Michael Van Canneyt via fpc-pascal



On Mon, 8 Mar 2021, Adriaan van Os via fpc-pascal wrote:


Michael Van Canneyt wrote:


The output for me is the same, regardless of the -FcUTF-8 flag being 
present

or not: question marks.

But if I add

uses cwstring;

all will be well.

Rationale:
Without that, the RTL cannot convert whatever the compiler wrote in
the binary to UTF8 to display it on the console.

The compiler people will need to explain what exactly the compiler writes
with or without the flag.


Well, this should at least produce a warning, if not an error. Silently 
producing the wrong code is  not a good idea.


Strictly speaking, there is no wrong code produced:

You didn't configure your environment to deal correctly with Unicode.
You're using the default widestring manager, which simply skips any non-ascii
characters.

All this is documented in various places, for example:

https://www.freepascal.org/docs-html/rtl/system/unicodesupport.html

Michael.
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Adriaan van Os via fpc-pascal

Michael Van Canneyt wrote:


The output for me is the same, regardless of the -FcUTF-8 flag being 
present

or not: question marks.

But if I add

uses cwstring;

all will be well.

Rationale:
Without that, the RTL cannot convert whatever the compiler wrote in
the binary to UTF8 to display it on the console.

The compiler people will need to explain what exactly the compiler writes
with or without the flag.


Well, this should at least produce a warning, if not an error. Silently producing the wrong code is 
not a good idea.


Regards,

Adriaan van Os

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Tomas Hajny via fpc-pascal

On 2021-03-08 11:59, Adriaan van Os via fpc-pascal wrote:


Hi,


adriaan% cat uniquizz-utf8.pas

{$codepage utf8}

program uniquizz;
var
  chars: UnicodeString;
begin
  chars := '⌘ key';
  writeln(chars);
  writeln(chars[1]);
  writeln( 'size ', sizeOf( chars));
  writeln( 'length ', length( chars));
end.

adriaan% fpc uniquizz-utf8.pas -FcUTF-8
Free Pascal Compiler version 3.0.4 [2018/09/30] for x86_64
Copyright (c) 1993-2017 by Florian Klaempfl and others
Target OS: Darwin for x86_64
Compiling uniquizz-utf8.pas
Assembling (pipe) uniquizz-utf8.s
Linking uniquizz-utf8
14 lines compiled, 0.1 sec

[l24:~/gpc/testfpc] adriaan% ./uniquizz-utf8
? key
?
size 8
length 5



This leaves me with a question mark too.


UnicodeString is a pointer from technical point of view, SizeOf 
(UnicodeString) thus always returns 8 on 64-bit platforms regardless of 
the string content. Michael already answered regarding the question mark 
output - you need a widestring manager to translate the character from 
the internal storage (UTF-16 - see uniquizz-utf8.s if compiled with -a) 
to your terminal charset.


Tomas
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Michael Van Canneyt via fpc-pascal



On Mon, 8 Mar 2021, Adriaan van Os via fpc-pascal wrote:


adriaan% cat uniquizz-utf8.pas

{$codepage utf8}

program uniquizz;
var
  chars: UnicodeString;
begin
  chars := '⌘ key';
  writeln(chars);
  writeln(chars[1]);
  writeln( 'size ', sizeOf( chars));
  writeln( 'length ', length( chars));
end.

adriaan% fpc uniquizz-utf8.pas -FcUTF-8
Free Pascal Compiler version 3.0.4 [2018/09/30] for x86_64
Copyright (c) 1993-2017 by Florian Klaempfl and others
Target OS: Darwin for x86_64
Compiling uniquizz-utf8.pas
Assembling (pipe) uniquizz-utf8.s
Linking uniquizz-utf8
14 lines compiled, 0.1 sec

[l24:~/gpc/testfpc] adriaan% ./uniquizz-utf8
? key
?
size 8
length 5



This leaves me with a question mark too.


The output for me is the same, regardless of the -FcUTF-8 flag being present
or not: question marks.

But if I add

uses cwstring;

all will be well.

Rationale:
Without that, the RTL cannot convert whatever the compiler wrote in
the binary to UTF8 to display it on the console.

The compiler people will need to explain what exactly the compiler writes
with or without the flag.

Michael.___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Adriaan van Os via fpc-pascal

adriaan% cat uniquizz-utf8.pas

{$codepage utf8}

program uniquizz;
var
  chars: UnicodeString;
begin
  chars := '⌘ key';
  writeln(chars);
  writeln(chars[1]);
  writeln( 'size ', sizeOf( chars));
  writeln( 'length ', length( chars));
end.

adriaan% fpc uniquizz-utf8.pas -FcUTF-8
Free Pascal Compiler version 3.0.4 [2018/09/30] for x86_64
Copyright (c) 1993-2017 by Florian Klaempfl and others
Target OS: Darwin for x86_64
Compiling uniquizz-utf8.pas
Assembling (pipe) uniquizz-utf8.s
Linking uniquizz-utf8
14 lines compiled, 0.1 sec

[l24:~/gpc/testfpc] adriaan% ./uniquizz-utf8
? key
?
size 8
length 5



This leaves me with a question mark too.

Regards,

Adriaan van Os
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-07 Thread Marco van de Voort via fpc-pascal


Op 2021-03-07 om 22:26 schreef Bart via fpc-pascal:

On Sun, Mar 7, 2021 at 5:31 PM Marco van de Voort via fpc-pascal
 wrote:


Probably it is not in the BMP and thus needs more position than one.

Length(Char) is 5 according to fpc, I see 5 "graphemes"


Indeed:

.Ld1$strlab:
    .short    1200,2
    .long    -1,5
.Ld1:
    .short    8984,8997,9003,8679,94,0

On win32 a quick test is hard since displaying unicode in the terminal 
is hard.



But a write for "widechar" is called:

   movl    U_$P$PROGRAM_$$_CHARS,%eax
    movw    (%eax),%cx
    movl    %ebx,%edx
    movl    $0,%eax
    call    fpc_write_text_widechar

so it should be ok then.

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-07 Thread Bart via fpc-pascal
On Sun, Mar 7, 2021 at 5:31 PM Marco van de Voort via fpc-pascal
 wrote:

> Probably it is not in the BMP and thus needs more position than one.

Length(Char) is 5 according to fpc, I see 5 "graphemes", which suggest
that all of them fit into 1 WideChar?

-- 
Bart
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-07 Thread Nikolay Nikolov via fpc-pascal


On 3/7/21 7:21 PM, Ryan Joseph via fpc-pascal wrote:



On Mar 7, 2021, at 10:11 AM, Marco van de Voort via fpc-pascal 
 wrote:


Yes it is. And there are about 1114000 unicode codepoints, or about 17 times 
what fits in a 2-byte wide char.

https://en.wikipedia.org/wiki/Code_point

https://en.wikipedia.org/wiki/UTF-16

I thought unicode strings "just worked" but maybe that's UTF-8 and the 
character I want is maybe UTF-16. What are you supposed to do then? UnicodeString knows 
how to print the full string so all the data is there but I can't index to get characters 
unless I know their size.


It depends on what you mean by "just working". UnicodeString is an 
UTF-16 encoded string and a WideChar is just a UTF-16 code unit. Both 
UTF-8 and UTF-16 are variable length encodings. UTF-16 is just more 
simple to decode. Note also that, even though a single Unicode codepoint 
might need two UTF-16 code units (i.e. WideChars), that is still not 
enough to represent what users perceive as a character. There are also 
plenty of Unicode combining characters. What most users perceive as a 
character is actually called an Extended Grapheme Cluster and is 
actually a sequence of Unicode code points. There's an algorithm (an 
enumerator) that splits a string into grapheme clusters, and that's 
implemented in FPC trunk in the GraphemeBreakProperty unit. It 
implements this algorithm:


http://www.unicode.org/reports/tr29/

This was done by me for the Unicode Free Vision port in the unicodekvm 
SVN branch, but it was already committed to trunk (the rest of the 
Unicode Free Vision still isn't), because it's a new unit that is 
relatively self-contained and provides new functionality (so, won't 
break existing code) that wasn't provided by the RTL before.


Note that normally, most programs wouldn't actually need to split a 
string into grapheme clusters, unless they implement something like a UI 
toolkit or a text editor or something of that sort. That's why it was 
needed for the Unicode Free Vision.


Nikolay

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-07 Thread Ryan Joseph via fpc-pascal


> On Mar 7, 2021, at 10:21 AM, Ryan Joseph  wrote:
> 
> I thought unicode strings "just worked" but maybe that's UTF-8 and the 
> character I want is maybe UTF-16. What are you supposed to do then? 
> UnicodeString knows how to print the full string so all the data is there but 
> I can't index to get characters unless I know their size.

Since this looks like it could be complicated here is what I was actually 
trying to do with the FreeType library. This works for ASCII but broke down 
with those unicode chars. I'm confused now because you say the character are 
more than 2 bytes so I don't know what the actual size of an element is.

  for glyph in '⌘⌥⌫⇧^' do
FT_Load_Char(m_face, ord(glyph), FT_LOAD_RENDER);

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-07 Thread Ryan Joseph via fpc-pascal


> On Mar 7, 2021, at 10:11 AM, Marco van de Voort via fpc-pascal 
>  wrote:
> 
> 
> Yes it is. And there are about 1114000 unicode codepoints, or about 17 times 
> what fits in a 2-byte wide char.
> 
> https://en.wikipedia.org/wiki/Code_point
> 
> https://en.wikipedia.org/wiki/UTF-16

I thought unicode strings "just worked" but maybe that's UTF-8 and the 
character I want is maybe UTF-16. What are you supposed to do then? 
UnicodeString knows how to print the full string so all the data is there but I 
can't index to get characters unless I know their size.

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-07 Thread Marco van de Voort via fpc-pascal


Op 2021-03-07 om 17:38 schreef Ryan Joseph via fpc-pascal:



On Mar 7, 2021, at 9:31 AM, Marco van de Voort via fpc-pascal 
 wrote:

Probably it is not in the BMP and thus needs more position than one.

Isn't char[1] a 2 byte wide char? Not sure I understand "more position than on" 
though.


Yes it is. And there are about 1114000 unicode codepoints, or about 17 
times what fits in a 2-byte wide char.


https://en.wikipedia.org/wiki/Code_point

https://en.wikipedia.org/wiki/UTF-16


___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-07 Thread Ryan Joseph via fpc-pascal


> On Mar 7, 2021, at 9:31 AM, Marco van de Voort via fpc-pascal 
>  wrote:
> 
> Probably it is not in the BMP and thus needs more position than one.

Isn't char[1] a 2 byte wide char? Not sure I understand "more position than on" 
though.

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Unicode chars losing information

2021-03-07 Thread Marco van de Voort via fpc-pascal


Op 2021-03-07 om 17:21 schreef Ryan Joseph via fpc-pascal:

I came across a bug which was caused but a unicode character losing information 
and narrowed it down to this. Why doesn't the chars[1] print the same character 
as appeared in the string?

var
   chars: UnicodeString;
begin
   chars := '⌘⌥⌫⇧^';
   writeln(chars);
   writeln(chars[1]);
end.

Prints:

⌘⌥⌫⇧^
?


Probably it is not in the BMP and thus needs more position than one.

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


[fpc-pascal] Unicode chars losing information

2021-03-07 Thread Ryan Joseph via fpc-pascal
I came across a bug which was caused but a unicode character losing information 
and narrowed it down to this. Why doesn't the chars[1] print the same character 
as appeared in the string? 

var
  chars: UnicodeString;
begin
  chars := '⌘⌥⌫⇧^';
  writeln(chars);
  writeln(chars[1]);
end.

Prints:

⌘⌥⌫⇧^
?


Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal