date:20230703




> On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal 
>  wrote:
> 
> No - in this case, the "header" is the highest bit of that byte being 0.

Oh it's the header BIT. Admittedly I don't understand how this function returns 
the highest bit using that case, which I think he was suggesting.

function UTF8CodepointSizeFast(p: PChar): integer;
begin
 case p^ of
   #0..#191   : Result := 1;
   #192..#223 : Result := 2;
   #224..#239 : Result := 3;
   #240..#247 : Result := 4;
   else Result := 1; // An optimization + prevents compiler warning about 
uninitialized Result.
 end;
end;

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Tomas Hajny via fpc-pascal

On 3 July 2023 8:42:05 +0200, Hairy Pixels via fpc-pascal 
 wrote:
>> On Jul 3, 2023, at 12:04 PM, Mattias Gaertner via fpc-pascal 
>>  wrote:
>> 
>> No, the header of a codepoint to figure out the length.
>
>so the smallest character UTF-8 can represent is 2 bytes? 1 for the header and 
>1 for the character? 
>
>ASCII #100 is the same character in UTF-8 but it needs a header byte, so 2 
>bytes?

No - in this case, the "header" is the highest bit of that byte being 0.

Tomas

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Tomas Hajny via fpc-pascal

On 3 July 2023 9:12:03 +0200, Hairy Pixels via fpc-pascal 
 wrote:
>> On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal 
>>  wrote:
>> 
>> No - in this case, the "header" is the highest bit of that byte being 0.
>
>Oh it's the header BIT. Admittedly I don't understand how this function 
>returns the highest bit using that case, which I think he was suggesting.
>
>function UTF8CodepointSizeFast(p: PChar): integer;
>begin
> case p^ of
>   #0..#191   : Result := 1;
>   #192..#223 : Result := 2;
>   #224..#239 : Result := 3;
>   #240..#247 : Result := 4;
>   else Result := 1; // An optimization + prevents compiler warning about 
> uninitialized Result.
> end;
>end;

That's why I wrote "in this case". The "header" itself is not fixed size 
either, but the algorithm above shows how you can derive the length from the 
first byte.

Tomas

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

> On Jul 3, 2023, at 12:04 PM, Mattias Gaertner via fpc-pascal 
>  wrote:
> 
> No, the header of a codepoint to figure out the length.

so the smallest character UTF-8 can represent is 2 bytes? 1 for the header and 
1 for the character? 

ASCII #100 is the same character in UTF-8 but it needs a header byte, so 2 
bytes?

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

> On Jul 3, 2023, at 3:05 PM, Mattias Gaertner via fpc-pascal 
>  wrote:
> 
> I wonder, is this thread about testing ChatGPT or do you want to
> implement something useful?
> There are already plenty of optimized UTF-8 functions in the FPC and
> Lazarus sources. Maybe too many, and you have trouble finding the right
> one? Just ask what your function needs to do.

I was just curious how ChatGPTs implementation compared to other programmer.

What I'm really trying to do is improve a parser so it can read UTF-8 files and 
decode unicode literals in the grammar.

Right now I've just read the file into an AnsiString and indexing assuming a 
fixed character size, which breaks of course if non-1 byte characters exist

 I also need to know if I come across something like \u1F496 I need to convert 
that to a unicode character.

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

On Mon, 3 Jul 2023 15:27:10 +0700
Hairy Pixels via fpc-pascal  wrote:

>[...]
> I was just curious how ChatGPTs implementation compared to other
> programmer.

Apparently the quality is often terrible. But it can be useful.

 
> What I'm really trying to do is improve a parser so it can read UTF-8
> files and decode unicode literals in the grammar.

First of all: Is it valid UTF-8 or do you have to check for broken or
malicious sequences?

 
> Right now I've just read the file into an AnsiString and indexing
> assuming a fixed character size, which breaks of course if non-1 byte
> characters exist

Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful:

function UTF8CodepointToUnicode(p: PChar; out CodepointLen: integer): Cardinal;

 
>  I also need to know if I come across something like \u1F496 I need
> to convert that to a unicode character.

I guess you know how to convert a hex to a dword. Then

function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to UTF8
function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): integer; // UTF32 to 
UTF8

Mattias
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

> On Jul 3, 2023, at 4:29 PM, Mattias Gaertner via fpc-pascal 
>  wrote:
> 
>> What I'm really trying to do is improve a parser so it can read UTF-8
>> files and decode unicode literals in the grammar.
> 
> First of all: Is it valid UTF-8 or do you have to check for broken or
> malicious sequences?

If they give the parser broken files that's their problem they need to fix? the 
user has control over the file so it's there responsibility I think.

> 
> 
>> Right now I've just read the file into an AnsiString and indexing
>> assuming a fixed character size, which breaks of course if non-1 byte
>> characters exist
> 
> Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful:
> 
> function UTF8CodepointToUnicode(p: PChar; out CodepointLen: integer): 
> Cardinal;

Not sure how this works. You need to advance by character so there return value 
should be the byte location of the next character or something like that.

> 
> 
>> I also need to know if I come across something like \u1F496 I need
>> to convert that to a unicode character.
> 
> I guess you know how to convert a hex to a dword.

Is there anything better than StrToInt? I wouldn't be able to do it myself 
though without that function.

> Then
> 
> function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to UTF8
> function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): integer; // UTF32 to 
> UTF8
> 

Ok I think this is basically what the other programmer submitted and what 
ChatGPT tried to do.

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

On Mon, 3 Jul 2023 12:01:11 +0700
Hairy Pixels via fpc-pascal  wrote:

> > On Jul 3, 2023, at 11:36 AM, Mattias Gaertner via fpc-pascal
> >  wrote:
> > 
> > Useless array of.
> > And it does not return the bytecount.  
> 
> it's an open array so what's the problem?
>[...]
> > Wrong for byteCount=1  
> 
> really? How so? 
>
> ChatGPT is risky because it will give wrong information with perfect
> confidence and there's no way for the ignorant person to know.

I wonder, is this thread about testing ChatGPT or do you want to
implement something useful?
There are already plenty of optimized UTF-8 functions in the FPC and
Lazarus sources. Maybe too many, and you have trouble finding the right
one? Just ask what your function needs to do.

Mattias
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread José Mejuto via fpc-pascal


El 03/07/2023 a las 10:27, Hairy Pixels via fpc-pascal escribió:


Right now I've just read the file into an AnsiString and indexing assuming a 
fixed character size, which breaks of course if non-1 byte characters exist

  I also need to know if I come across something like \u1F496 I need to convert 
that to a unicode character.



Hello,

You are intermixing a lot of concepts, ASCII, Unicode, grapheme, 
representation, content, etc...


Talking about Unicode you must forget ASCII, the text is a sequence of 
bytes which are encoded in a special format (UTF-8, UTF-16, UTF-32,...) 
and that must be represented in screen using Unicode representation 
rules, which are not the same as ASCII.


Just to keep this message quite short, think in a text with only one 
"letter":


"á"

This text (text, not one letter, Unicode is about texts) can be 
transmitted or stored using Unicode encoding rules which are a sequence 
of bytes with its own rules to encode the information. Each byte is 
hexadecimal:


UTF8: C3 A1
UTF16LE: 00 E1
UTF32: 00 00 00 E1

You must know in advance the encoding format to get the text from the 
bytes sequence. There is also a BOM (Byte Order Mark) which is sometimes 
used in files as a header to indicate the encoding, but in general it is 
not used.


Now decoding that sequence of bytes, using the right decoding format you 
get a text which represent the letter "a" with an acute accent, but 
Unicode is *not* so *simple* and the same text could be represented in 
screen using letter "a" + "combining acute accent" and bytes sequence is 
totally different, different at encoding level but identical at 
renderization level. So this two UTF8 sequences:


"C3 A1" and "61 CC 81"

are different at grapheme level and encoding level but identical at 
representation level.


Just as final note, this is the UTF-8 sequence of bytes for one single 
"character" in screen:


F0 9F 8F B4 F3 A0 81 A7 F3 A0 81 A2 F3 A0 81 B3 F3 A0 81 A3 F3 A0 81 B4 
F3 A0 81 BF


Unicode is far, far from easy.

Have a nice day.
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

[fpc-pascal] Lazarus Release Candidate 1 of 3.0

The Lazarus team is glad to announce the first release candidate of
Lazarus 3.0.

This release was built with FPC 3.2.2.

Here is the list of changes for Lazarus and Free Pascal:
http://wiki.lazarus.freepascal.org/Lazarus_3.0_release_notes
http://wiki.lazarus.freepascal.org/User_Changes_3.2.2

Here is the list of fixes for Lazarus 3.x:
https://gitlab.com/freepascal.org/lazarus/lazarus/-/commits/fixes_3_0/

The release is available for download on SourceForge:
http://sourceforge.net/projects/lazarus/files/

Choose your CPU, OS, distro and then the "Lazarus 3.0RC1" directory.

Checksums for the SourceForge files:
https://www.lazarus-ide.org/index.php?page=checksums#3_0RC1

Minimum requirements:

Windows:
   2k, 32 or 64bit.

FreeBSD/Linux:
   gtk 2.24 for gtk2, qt4.5 for qt, qt5.6 for qt5, 32 or 64bit.

Mac OS X:
   Cocoa (64bit) 10.12, Carbon (32bit) 10.5 to 10.14, qt and
   qt5 (32 or 64bit).

The gitlab page:
https://gitlab.com/freepascal.org/lazarus/lazarus/-/tree/lazarus_3_0_RC1

For people who are blocked by SF, the Lazarus releases from SourceForge
are mirrored at:ftp://ftp.freepascal.org/pub/lazarus/releases/

== Why should everybody (including you) test the release candidate? ==

In the past weeks the Lazarus team has stabilized the 3.0 fixes branch.
The resulting 3.0RC1 is now stable enough to be used by any one for
test purposes.

However many of the fixes and new features that were committed since
the release of 2.2.6 required changes to the code of existing features
too. While we have tested those ourselves, there may still be problems
that only occur with very specific configurations or one project in a
million.

Yes, it may be that you are the only person with a project, that will
not work in the new IDE. So if you do not test, we can not fix it.

Please do not wait for the final release, in order to test. It may be
too late. Once the release is out we will have to be more selective
about which fixes can be merged for further 3.x releases. So it may be,
that we can not merge the fix you require. And then you will miss out
on all the new features.

== How to test ==

Download and install the 3.0 RC1.
- On Windows you can install as a 2ndary install, that will not affect
  your current install:
  
http://wiki.lazarus.freepascal.org/Multiple_Lazarus#Installation_of_multiple_Lazarus
- On other platforms, if you install to a new location you need to use
--primary-config-path

In either case you should make backups. (including your primary config)

Open your project in the current Lazarus (3.0), and use "Publish
Project" from the project menu. This creates a clean copy of your
project.

You can then open that copy in the RC1. Please test:
- If you can edit forms in the designer
   - rename components / change properties in Object inspector / Add
 new events
   - Add components to form / Move components on form
   - Frames, if you use them
- If you can navigate the source code (e.g. jump to implementation)
- Auto completion in source code
- Compile, debug and run
- Anything else you use in your daily work


Mattias
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Jer Haan via fpc-pascal

Hi Ryan,I’ve created attached unit, which takes a code point and returns the utf8 char as a string. It’s based on the Wikipedia article on UTF8UTF-8 encodes code points in one to four bytes, depending on the value of the code point. The x characters are replaced by the bits of the code point:This table is copied from Wikipedia.

uencoding.pas
Description: Binary data
Hope it’s useful for you. If you improve the code pls let me know.Best regards,JeroenOn 2 Jul 2023, at 15:30, Hairy Pixels via fpc-pascal  wrote:I'm interested in parsing unicode scalars (I think they're called) to byte sized values but I'm not sure where to start. First thing I did was choose the unicode scalar U+1F496 ().Next I cheated and ask ChatGPT. :) Amazingly from my question it was able to tell me the scaler is comprised of these 4 bytes: 240 159 146 150I was able to correctly concatenate these characters and writeln printed the correct character.var	s: String;begins := char(240)+char(159)+char(146)+char(150);writeln(s);end.The question is, how was 1F496 decomposed into 4 bytes? Regards,	Ryan Joseph___fpc-pascal maillist  -  fpc-pascal@lists.freepascal.orghttps://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

On Mon, 3 Jul 2023 14:12:03 +0700
Hairy Pixels via fpc-pascal  wrote:

> > On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal
> >  wrote:
> > 
> > No - in this case, the "header" is the highest bit of that byte
> > being 0.  
> 
> Oh it's the header BIT. Admittedly I don't understand how this
> function returns the highest bit using that case, which I think he
> was suggesting.

A first byte of an UTF-8 codepoint is 0..127,192..247.
The second, third, fourth byte are between 128..191, so you can easily
detect where a codepoint starts.
And from the first byte you can derive the length of the codepoint.
If you just want to skip over n codepoints, then the below function does
the job:

 
> function UTF8CodepointSizeFast(p: PChar): integer;
> begin
>  case p^ of
>#0..#191   : Result := 1;
>#192..#223 : Result := 2;
>#224..#239 : Result := 3;
>#240..#247 : Result := 4;
>else Result := 1; // An optimization + prevents compiler warning
> about uninitialized Result. end;
> end;

Mattias
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

On Mon, 3 Jul 2023 17:18:56 +0700
Hairy Pixels via fpc-pascal  wrote:

>[...]
> > First of all: Is it valid UTF-8 or do you have to check for broken
> > or malicious sequences?  
> 
> If they give the parser broken files that's their problem they need
> to fix? the user has control over the file so it's there
> responsibility I think.

Users responsibility?
 - I recommend to check for malicious codes. ;)


> >> Right now I've just read the file into an AnsiString and indexing
> >> assuming a fixed character size, which breaks of course if non-1
> >> byte characters exist  
> > 
> > Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful:
> > 
> > function UTF8CodepointToUnicode(p: PChar; out CodepointLen:
> > integer): Cardinal;  
> 
> Not sure how this works. You need to advance by character so there
> return value should be the byte location of the next character or
> something like that.

function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt;
// returns the number of codepoints
var
  CodePointLen: longint;
  CodePoint: longword;
begin
  Result:=0;
  while (ByteCount>0) do begin
inc(Result);
CodePoint:=UTF8CodepointToUnicode(p,CodePointLen);
...do something with the CodePoint...
inc(p,CodePointLen);
dec(ByteCount,CodePointLen);
  end;
end;


> >> I also need to know if I come across something like \u1F496 I need
> >> to convert that to a unicode character.  
> > 
> > I guess you know how to convert a hex to a dword.  
> 
> Is there anything better than StrToInt?

Good start.

> I wouldn't be able to do it
> myself though without that function.

Hex to dword. That's easy enough for ChatGPT.


> > function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to
> > UTF8 function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar):
> > integer; // UTF32 to UTF8 
> 
> Ok I think this is basically what the other programmer submitted and
> what ChatGPT tried to do.

Yes, no need to reinvent the wheel.

Mattias
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

> On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal 
>  wrote:
> 
> function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt;
> // returns the number of codepoints
> var
>  CodePointLen: longint;
>  CodePoint: longword;
> begin
>  Result:=0;
>  while (ByteCount>0) do begin
>inc(Result);
>CodePoint:=UTF8CodepointToUnicode(p,CodePointLen);
>...do something with the CodePoint...
>inc(p,CodePointLen);
>dec(ByteCount,CodePointLen);
>  end;
> end;

Thanks, this looks right. I guess this is how we need to iterate over unicode 
now.

Btw, why isn't there a for-loop we can use over unicode strings? seems like 
that should be supported out of the box. I had this same problem in Swift also 
where it's extremely confusing to merely iterate over a string and look at each 
character. Replacing characters will be tricky also so we need some good 
library functions.

Swift is especially terrible because there's NO ANSII string so even a 1 byte 
sequence needs all these confusing as hell functions to do any work with 
strings at all. Terrible experience and slow.

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

[fpc-pascal] ShortString still relevant today?

I've been exploring the string types and I'm curious now, does the classic 
Pascal "ShortString" even make sense anymore on modern computers? I'm running 
tests and I can't seem to find a way in which AnsiString overall performs worse 
than ShortString. 

Are there any examples where AnsiString is worse? I think if you passed strings 
around lots that would trigger the ref counting and InterlockedExchange (I saw 
this in my own code before and it unnerved me) but that's been hard to test.

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar




On 7/4/23 04:03, Hairy Pixels via fpc-pascal wrote:



On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal 
 wrote:

function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt;
// returns the number of codepoints
var
  CodePointLen: longint;
  CodePoint: longword;
begin
  Result:=0;
  while (ByteCount>0) do begin
inc(Result);
CodePoint:=UTF8CodepointToUnicode(p,CodePointLen);
...do something with the CodePoint...
inc(p,CodePointLen);
dec(ByteCount,CodePointLen);
  end;
end;

Thanks, this looks right. I guess this is how we need to iterate over unicode 
now.

Btw, why isn't there a for-loop we can use over unicode strings? seems like 
that should be supported out of the box. I had this same problem in Swift also 
where it's extremely confusing to merely iterate over a string and look at each 
character. Replacing characters will be tricky also so we need some good 
library functions.


You're still confusing the Unicode terms. The above code iterates over 
Unicode Code Points, not "characters" in a UTF-8 encoded string. A 
Unicode Code Point is not a "character":


https://unicode.org/glossary/#character

https://unicode.org/glossary/#code_point

There are also graphemes, grapheme clusters and extended grapheme 
clusters - these terms can also be perceived as "characters".


https://unicode.org/glossary/#grapheme

https://unicode.org/glossary/#grapheme_cluster

https://unicode.org/glossary/#extended_grapheme_cluster

If you want to iterate over extended grapheme clusters, for example, 
there's an iterator (written by me) in the unit graphemebreakproperty.pp 
in the rtl-unicode package.


If you use the 'char' type in Pascal to iterate over an UTF-8 encoded 
string, you're iterating over Unicode code units (units! not code 
points! https://unicode.org/glossary/#code_unit).


If you use the 'widechar' type in Pascal to iterate over a UnicodeString 
(which is a UTF-16 encoded string), you're also iterating over Unicode 
code units, however this time in UTF-16 encoding.


If you want to iterate over Unicode code points (not units! not 
characters! not graphemes!) in a UTF-8 string, you need something like 
the ReadUTF8 function above. If you want to iterate over Unicode code 
points in a UTF-16 string, you need different code.


You need to understand all these terms and know exactly what you need to 
do. E.g. are you dealing with keyboard input, are you dealing with the 
low level parts of text display, are you searching for something in the 
text, are you just passing strings around and letting the GUI deal with 
it? These are all different use cases, and they require careful 
understanding what Unicode thing you need to iterate over.


Nikolay

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] ShortString still relevant today?




On 7/4/23 04:19, Hairy Pixels via fpc-pascal wrote:

I've been exploring the string types and I'm curious now, does the classic Pascal 
"ShortString" even make sense anymore on modern computers? I'm running tests 
and I can't seem to find a way in which AnsiString overall performs worse than 
ShortString.

Are there any examples where AnsiString is worse? I think if you passed strings 
around lots that would trigger the ref counting and InterlockedExchange (I saw 
this in my own code before and it unnerved me) but that's been hard to test.


ShortString is mainly for compatibility with Turbo Pascal, not for 
performance, IMHO. Although the FPC compiler itself still uses 
ShortString for performance reasons (I think the main advantage is the 
avoidance of the implicit try..finally blocks, needed for ansistrings). 
It might be interesting to benchmark the compiler with AnsiStrings 
instead of ShortStrings and see if there's a performance difference. But 
even if there is, a compiler is an extreme example. For 99% of the 
programs, performance impact of AnsiString is not an issue. I put {$H+} 
in almost all my new programs. I'd say that in 99% of the legit use 
cases, ShortString is used and needed for compatibility with legacy 
code, not for performance. Switching legacy code to {$H+} doesn't always 
work and may need additional fixes. Old code does things like S[0] := x 
instead of SetLength(S, x), etc. It also does uglier things, like 
FillChar() or Move() directly to/from string memory, or saves 
ShortStrings to files, as a part of a record, etc.


Nikolay

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar



On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote:



On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal 
 wrote:

For what grammar? What characters are allowed in a token? For example, Free 
Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, 
it doesn't need to understand Unicode characters, so it works on the byte 
(Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code 
units). That's because UTF-8 has two nice properties:

1)  ASCII character are encoded as they are - by using bytes in the range 
#0..#127

2) non-ASCII characters will always use a sequence of bytes, that are all in 
the range #128..#255 (they have their highest bit set), so they will never be 
misinterpreted as ASCII.

So, the tokenizer just works with UTF-8 like with any other 8-bit code page.

yes this works until you reach a non-ASCII ranged character and then the 
character index no longer matches the string 1 to 1. For example consider this 
was pascal:

i := '';

You can advance by index like:

  Inc(currrentIndex);
  c := text[currentIndex];

but once you hit the bear the offset is now wrong so you can't advance to the 
next character by doing +1.


But you just don't need to do this, in order to tokenize Pascal. The 
beginning and the end of the string literal is the apostrophe, which is 
ASCII. The bear is a sequence of UTF-8 code units (opaque to the 
compiler), that will not be mistaken for an apostrophe, or end of line, 
because they will have their high bit set. There's simply no need for a 
Pascal tokenizer to iterate over UTF-8 code points, instead of code units.


Nikolay

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar



> On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal 
>  wrote:
> 
> For what grammar? What characters are allowed in a token? For example, Free 
> Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, 
> it doesn't need to understand Unicode characters, so it works on the byte 
> (Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code 
> units). That's because UTF-8 has two nice properties:
> 
> 1)  ASCII character are encoded as they are - by using bytes in the range 
> #0..#127
> 
> 2) non-ASCII characters will always use a sequence of bytes, that are all in 
> the range #128..#255 (they have their highest bit set), so they will never be 
> misinterpreted as ASCII.
> 
> So, the tokenizer just works with UTF-8 like with any other 8-bit code page.

yes this works until you reach a non-ASCII ranged character and then the 
character index no longer matches the string 1 to 1. For example consider this 
was pascal:

i := '';

You can advance by index like:

 Inc(currrentIndex);
 c := text[currentIndex];

but once you hit the bear the offset is now wrong so you can't advance to the 
next character by doing +1.

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar



On 7/4/23 07:56, Hairy Pixels via fpc-pascal wrote:



On Jul 4, 2023, at 11:50 AM, Hairy Pixels  wrote:

You know you're right, with properly enclosed patterns you can capture 
everything inside and it works. You won't know if you had unicode in your 
string or not though but that depends on what's being parsed and if you care or 
not (I'm doing a TOML parser).

Sorry I'm still curious even though it's not my current problem :)

How can I make this program output the expected results:

   w: widechar;
   a: array of widechar;
begin
for w in 'abc' do
  a += [w];
   // Outputs 7 instead of 4
   writeln(length(a));
end;

The user doesn't know about unicode they just want to get an array of 
characters and not worry about all these little details. What can FPC do to 
solve this problem?


Depends on what you need, but I suppose in this case you want to count 
the number of extended grapheme clusters (a.k.a. "user perceived 
characters" - how many character-like things are displayed on the 
screen). You might be tempted to count the number of Unicode code 
points, but that's not the same, due to the existence of combining 
characters:


https://en.wikipedia.org/wiki/Combining_character

For extended grapheme clusters, there's an iterator in the 
graphemebreakproperty unit. I implemented this for the Unicode KVM and 
FreeVision. There it's needed for figuring out how many character blocks 
in the console will be needed to display a certain string. For the 
console or other GUIs that use fixed width fonts, there's also the East 
Asian Width property as well - some characters (East Asian - Chinese, 
Japanese, Korean) take double the space. So, to figure out where to move 
the cursor, you need to take East Asian Width as well.


Nikolay

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

> On Jul 4, 2023, at 9:58 AM, Nikolay Nikolov via fpc-pascal 
>  wrote:
> 
> You need to understand all these terms and know exactly what you need to do. 
> E.g. are you dealing with keyboard input, are you dealing with the low level 
> parts of text display, are you searching for something in the text, are you 
> just passing strings around and letting the GUI deal with it? These are all 
> different use cases, and they require careful understanding what Unicode 
> thing you need to iterate over.

Thanks for trying to help but this is more complicated than I thought and I 
don't have the patience for a deep dive right now :)

Unicode is complicated under the hood but we should have some libraries to help 
right? I mean the user thinks of these things as "characters" be it "A" or the 
unicode symbol  so we should be able to operate on that basis as well. 
Something like an iterator that return the character (wide char) and  byte 
offset or writing would be a nice place to start.

I have a parser/tokenizer I want to update so I'm trying to find tokens by 
advancing one character at a time. That's why I have a requirement to know 
which character is next in the file and probably the byte offset also so it can 
be referenced later.

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

On 7/4/23 07:17, Hairy Pixels via fpc-pascal wrote:

On Jul 4, 2023, at 9:58 AM, Nikolay Nikolov via fpc-pascal
wrote:

You need to understand all these terms and know exactly what you need to do.
E.g. are you dealing with keyboard input, are you dealing with the low level
parts of text display, are you searching for something in the text, are you
just passing strings around and letting the GUI deal with it? These are all
different use cases, and they require careful understanding what Unicode thing
you need to iterate over.

Thanks for trying to help but this is more complicated than I thought and I
don't have the patience for a deep dive right now :)

Unicode is complicated under the hood but we should have some libraries to help right? I mean the
user thinks of these things as "characters" be it "A" or the unicode symbol 
so we should be able to operate on that basis as well. Something like an iterator that return the
character (wide char) and byte offset or writing would be a nice place to start.

I have a parser/tokenizer I want to update so I'm trying to find tokens by
advancing one character at a time. That's why I have a requirement to know
which character is next in the file and probably the byte offset also so it can
be referenced later.

For what grammar? What characters are allowed in a token? For example,
Free Pascal also has a parser/tokenizer, but since Pascal keywords are
ASCII only, it doesn't need to understand Unicode characters, so it
works on the byte (Pascal's char type) level (for UTF-8 files, this
means UTF-8 Unicode code units). That's because UTF-8 has two nice
properties:

1) ASCII character are encoded as they are - by using bytes in the
range #0..#127

2) non-ASCII characters will always use a sequence of bytes, that are
all in the range #128..#255 (they have their highest bit set), so they
will never be misinterpreted as ASCII.

So, the tokenizer just works with UTF-8 like with any other 8-bit code page.

Nikolay

___
fpc-pascal maillist - fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar



> On Jul 4, 2023, at 11:50 AM, Hairy Pixels  wrote:
> 
> You know you're right, with properly enclosed patterns you can capture 
> everything inside and it works. You won't know if you had unicode in your 
> string or not though but that depends on what's being parsed and if you care 
> or not (I'm doing a TOML parser).

Sorry I'm still curious even though it's not my current problem :)

How can I make this program output the expected results:

  w: widechar;
  a: array of widechar;
begin
   for w in 'abc' do
 a += [w];
  // Outputs 7 instead of 4 
  writeln(length(a));
end;

The user doesn't know about unicode they just want to get an array of 
characters and not worry about all these little details. What can FPC do to 
solve this problem?


Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] ShortString still relevant today?




> On Jul 4, 2023, at 10:11 AM, Nikolay Nikolov via fpc-pascal 
>  wrote:
> 
> ShortString is mainly for compatibility with Turbo Pascal, not for 
> performance, IMHO. Although the FPC compiler itself still uses ShortString 
> for performance reasons (I think the main advantage is the avoidance of the 
> implicit try..finally blocks, needed for ansistrings). It might be 
> interesting to benchmark the compiler with AnsiStrings instead of 
> ShortStrings and see if there's a performance difference. But even if there 
> is, a compiler is an extreme example. For 99% of the programs, performance 
> impact of AnsiString is not an issue. I put {$H+} in almost all my new 
> programs. I'd say that in 99% of the legit use cases, ShortString is used and 
> needed for compatibility with legacy code, not for performance. Switching 
> legacy code to {$H+} doesn't always work and may need additional fixes. Old 
> code does things like S[0] := x instead of SetLength(S, x), etc. It also does 
> uglier things, like FillChar() or Move() directly to/from string memory, or 
> saves ShortStrings to files, as a part of
  a record, etc.

One thing I can think of now is that adding an AnsiString to a record or class 
makes that type "managed" and so it needs extra finalization when going out of 
scope. Static arrays will need to finalize their members too and the RTL has to 
do extra work in the lists classes to ensure this happens , which bloats the 
generic container types.

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar




> On Jul 4, 2023, at 11:45 AM, Nikolay Nikolov via fpc-pascal 
>  wrote:
> 
> But you just don't need to do this, in order to tokenize Pascal. The 
> beginning and the end of the string literal is the apostrophe, which is 
> ASCII. The bear is a sequence of UTF-8 code units (opaque to the compiler), 
> that will not be mistaken for an apostrophe, or end of line, because they 
> will have their high bit set. There's simply no need for a Pascal tokenizer 
> to iterate over UTF-8 code points, instead of code units.

You know you're right, with properly enclosed patterns you can capture 
everything inside and it works. You won't know if you had unicode in your 
string or not though but that depends on what's being parsed and if you care or 
not (I'm doing a TOML parser).

Maybe I can skip that part and just focus on the decoding of the unicode scalars

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar



On 7/4/23 08:08, Nikolay Nikolov wrote:


On 7/4/23 07:56, Hairy Pixels via fpc-pascal wrote:



On Jul 4, 2023, at 11:50 AM, Hairy Pixels  wrote:

You know you're right, with properly enclosed patterns you can 
capture everything inside and it works. You won't know if you had 
unicode in your string or not though but that depends on what's 
being parsed and if you care or not (I'm doing a TOML parser).

Sorry I'm still curious even though it's not my current problem :)

How can I make this program output the expected results:

   w: widechar;
   a: array of widechar;
begin
    for w in 'abc' do
  a += [w];
   // Outputs 7 instead of 4
   writeln(length(a));
end;

The user doesn't know about unicode they just want to get an array of 
characters and not worry about all these little details. What can FPC 
do to solve this problem?


Depends on what you need, but I suppose in this case you want to count 
the number of extended grapheme clusters (a.k.a. "user perceived 
characters" - how many character-like things are displayed on the 
screen). You might be tempted to count the number of Unicode code 
points, but that's not the same, due to the existence of combining 
characters:


https://en.wikipedia.org/wiki/Combining_character

For extended grapheme clusters, there's an iterator in the 
graphemebreakproperty unit. I implemented this for the Unicode KVM and 
FreeVision. There it's needed for figuring out how many character 
blocks in the console will be needed to display a certain string. For 
the console or other GUIs that use fixed width fonts, there's also the 
East Asian Width property as well - some characters (East Asian - 
Chinese, Japanese, Korean) take double the space. So, to figure out 
where to move the cursor, you need to take East Asian Width as well.


For console apps that use the Unicode KVM video unit, I've introduced 
two functions for determining the display width of a Unicode string in 
the video unit:


function ExtendedGraphemeClusterDisplayWidth(const EGC: UnicodeString): 
Integer;
{ Returns the number of display columns needed for the given extended 
grapheme cluster }


function StringDisplayWidth(const S: UnicodeString): Integer;
{ Returns the number of display columns needed for the given string }

Remember, the display width is different than the number of graphemes, 
due to East Asian double width characters.


And these work with UnicodeString, which is UTF-16, not UTF-8. But Free 
Pascal can convert between the two.


Nikolay

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] ShortString still relevant today?

Here is my test unit I'm playing with. It's crude but can anyone suggest what 
other things I could test? 

I'm playing with a string pointer also to see how ref counting/finalization 
plays in. Making your own managed typed using management operators is not 
tested but I'm sure it will be terrible compared to everything else.

* test_short_string time: 143ms
* test_ansi_string time: 115ms
* test_mem_string time: 115ms

* test_short_string_record time: 165ms
* test_ansi_string_record time: 75ms
* test_mem_string_record time: 47ms

* test_short_string_mutate time: 203ms
* test_ansi_string_mutate time: 181ms

===

{$mode objfpc}

program string_test;
uses
  SysUtils, DateUtils;

const
  ITERATIONS = 1000 * 1000;
  TEST_STRING = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit';

type
  TTestProc = procedure;

procedure test_mem_string;

  procedure do_pass(const s: PString; len: Integer);
  var
c: Char;
i: Integer;
  begin
for i := 0 to len - 1 do
  c := s^[i];
  end;

var
  s: PString;
  i, len: Integer;
begin
  for i := 0 to ITERATIONS - 1 do
begin
  len := Length(TEST_STRING);
  s := GetMem(len);
  s^ := TEST_STRING;
  do_pass(s, len);
  FreeMem(s);
end;
end;

procedure test_short_string;

  procedure do_pass(const s: ShortString);
  var
c: Char;
i: Integer;
  begin
for i := 0 to length(s) - 1 do
  c := s[i];
  end;

var
  s: Shortstring;
  i: Integer;
begin
  for i := 0 to ITERATIONS - 1 do
begin
  s := TEST_STRING;
  do_pass(s);
end;
end;

procedure test_ansi_string;

  procedure ansi_string_pass(const s: AnsiString);
  var
c: Char;
i: Integer;
  begin
for i := 0 to length(s) - 1 do
  c := s[i];
  end;

var
  s: AnsiString;
  i: Integer;
begin
  for i := 0 to ITERATIONS - 1 do
begin
  s := TEST_STRING;
  ansi_string_pass(s);
end;
end;

procedure test_ansi_string_mutate;
var
  i, j: Integer;
  s1, s2: AnsiString;
begin
  for i := 0 to ITERATIONS - 1 do
begin
  s1 := TEST_STRING;
  s2 := s1 + IntToStr(i);
  for j := 1 to length(s2) - 1 do
s2[j] := 'x';
end;
end;

procedure test_short_string_mutate;
var
  i, j: Integer;
  s1, s2: ShortString;
begin
  for i := 0 to ITERATIONS - 1 do
begin
  s1 := TEST_STRING;
  s2 := s1 + IntToStr(i);
  for j := 1 to length(s2) - 1 do
s2[j] := 'x';
end;
end;

procedure test_short_string_record;

type
  TMyRecord = record
a: ShortString;
b: ShortString;
c: ShortString;
  end;

function do_pass(rec: TMyRecord): TMyRecord;
begin
  result := rec;
end;

var
  i: Integer;
  r: TMyRecord;
begin
  for i := 0 to ITERATIONS - 1 do
begin
  r.a := TEST_STRING;
  r.b := TEST_STRING;
  r.c := TEST_STRING;
  do_pass(r);
end;
end;


procedure test_ansi_string_record;

type
  TMyRecord = record
a: AnsiString;
b: AnsiString;
c: AnsiString;
  end;

function do_pass(rec: TMyRecord): TMyRecord;
begin
  result := rec;
end;

var
  i: Integer;
  r: TMyRecord;
begin
  for i := 0 to ITERATIONS - 1 do
begin
  r.a := TEST_STRING;
  r.b := TEST_STRING;
  r.c := TEST_STRING;

  do_pass(r);
end;
end;

procedure test_mem_string_record;

type
  TMyRecord = record
a: PString;
b: PString;
c: PString;
  end;

function do_pass(rec: TMyRecord): TMyRecord;
begin
  result := rec;
end;

var
  i: Integer;
  r: TMyRecord;
  len: Integer;
begin
  len := Length(TEST_STRING);

  for i := 0 to ITERATIONS - 1 do
begin
  r.a := GetMem(len);
  r.b := GetMem(len);
  r.c := GetMem(len);

  r.a^ := TEST_STRING;
  r.b^ := TEST_STRING;
  r.c^ := TEST_STRING;

  do_pass(r);
end;
end;


procedure run_test(name: String; test: TTestProc);
var
  startTime: Double;
begin
  startTime := Now;
  test;
  writeln('* ', name,' time: ', MilliSecondsBetween(Now, StartTime), 'ms');
end;

begin
  run_test('test_short_string', @test_short_string);
  run_test('test_ansi_string', @test_ansi_string);
  run_test('test_mem_string', @test_ansi_string);

  run_test('test_short_string_record', @test_short_string_record);
  run_test('test_ansi_string_record', @test_ansi_string_record);
  run_test('test_mem_string_record', @test_mem_string_record);

  run_test('test_short_string_mutate', @test_short_string_mutate);
  run_test('test_ansi_string_mutate', @test_ansi_string_mutate);
end.

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar