Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal > wrote: > > No - in this case, the "header" is the highest bit of that byte being 0. Oh it's the header BIT. Admittedly I don't understand how this function returns the highest bit using that case, which I think he was suggesting.

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Tomas Hajny via fpc-pascal
On 3 July 2023 8:42:05 +0200, Hairy Pixels via fpc-pascal wrote: >> On Jul 3, 2023, at 12:04 PM, Mattias Gaertner via fpc-pascal >> wrote: >> >> No, the header of a codepoint to figure out the length. > >so the smallest character UTF-8 can represent is 2 bytes? 1 for the header and >1 for

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Tomas Hajny via fpc-pascal
On 3 July 2023 9:12:03 +0200, Hairy Pixels via fpc-pascal wrote: >> On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal >> wrote: >> >> No - in this case, the "header" is the highest bit of that byte being 0. > >Oh it's the header BIT. Admittedly I don't understand how this function

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 3, 2023, at 12:04 PM, Mattias Gaertner via fpc-pascal > wrote: > > No, the header of a codepoint to figure out the length. so the smallest character UTF-8 can represent is 2 bytes? 1 for the header and 1 for the character? ASCII #100 is the same character in UTF-8 but it needs a

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 3, 2023, at 3:05 PM, Mattias Gaertner via fpc-pascal > wrote: > > I wonder, is this thread about testing ChatGPT or do you want to > implement something useful? > There are already plenty of optimized UTF-8 functions in the FPC and > Lazarus sources. Maybe too many, and you have

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Mattias Gaertner via fpc-pascal
On Mon, 3 Jul 2023 15:27:10 +0700 Hairy Pixels via fpc-pascal wrote: >[...] > I was just curious how ChatGPTs implementation compared to other > programmer. Apparently the quality is often terrible. But it can be useful. > What I'm really trying to do is improve a parser so it can read UTF-8

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 3, 2023, at 4:29 PM, Mattias Gaertner via fpc-pascal > wrote: > >> What I'm really trying to do is improve a parser so it can read UTF-8 >> files and decode unicode literals in the grammar. > > First of all: Is it valid UTF-8 or do you have to check for broken or > malicious

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Mattias Gaertner via fpc-pascal
On Mon, 3 Jul 2023 12:01:11 +0700 Hairy Pixels via fpc-pascal wrote: > > On Jul 3, 2023, at 11:36 AM, Mattias Gaertner via fpc-pascal > > wrote: > > > > Useless array of. > > And it does not return the bytecount. > > it's an open array so what's the problem? >[...] > > Wrong for byteCount=1

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread José Mejuto via fpc-pascal
El 03/07/2023 a las 10:27, Hairy Pixels via fpc-pascal escribió: Right now I've just read the file into an AnsiString and indexing assuming a fixed character size, which breaks of course if non-1 byte characters exist I also need to know if I come across something like \u1F496 I need to

[fpc-pascal] Lazarus Release Candidate 1 of 3.0

2023-07-03 Thread Mattias Gaertner via fpc-pascal
The Lazarus team is glad to announce the first release candidate of Lazarus 3.0. This release was built with FPC 3.2.2. Here is the list of changes for Lazarus and Free Pascal: http://wiki.lazarus.freepascal.org/Lazarus_3.0_release_notes http://wiki.lazarus.freepascal.org/User_Changes_3.2.2

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Jer Haan via fpc-pascal
Hi Ryan,I’ve created attached unit, which takes a code point and returns the utf8 char as a string. It’s based on the Wikipedia article on UTF8UTF-8 encodes code points in one to four bytes, depending on the value of the code point. The x characters are replaced by the bits of the code point:This

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Mattias Gaertner via fpc-pascal
On Mon, 3 Jul 2023 14:12:03 +0700 Hairy Pixels via fpc-pascal wrote: > > On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal > > wrote: > > > > No - in this case, the "header" is the highest bit of that byte > > being 0. > > Oh it's the header BIT. Admittedly I don't understand how this

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Mattias Gaertner via fpc-pascal
On Mon, 3 Jul 2023 17:18:56 +0700 Hairy Pixels via fpc-pascal wrote: >[...] > > First of all: Is it valid UTF-8 or do you have to check for broken > > or malicious sequences? > > If they give the parser broken files that's their problem they need > to fix? the user has control over the file

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal > wrote: > > function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt; > // returns the number of codepoints > var > CodePointLen: longint; > CodePoint: longword; > begin > Result:=0; > while (ByteCount>0) do begin >

[fpc-pascal] ShortString still relevant today?

2023-07-03 Thread Hairy Pixels via fpc-pascal
I've been exploring the string types and I'm curious now, does the classic Pascal "ShortString" even make sense anymore on modern computers? I'm running tests and I can't seem to find a way in which AnsiString overall performs worse than ShortString. Are there any examples where AnsiString is

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Nikolay Nikolov via fpc-pascal
On 7/4/23 04:03, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal wrote: function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt; // returns the number of codepoints var CodePointLen: longint; CodePoint: longword; begin Result:=0; while

Re: [fpc-pascal] ShortString still relevant today?

2023-07-03 Thread Nikolay Nikolov via fpc-pascal
On 7/4/23 04:19, Hairy Pixels via fpc-pascal wrote: I've been exploring the string types and I'm curious now, does the classic Pascal "ShortString" even make sense anymore on modern computers? I'm running tests and I can't seem to find a way in which AnsiString overall performs worse than

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Nikolay Nikolov via fpc-pascal
On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal wrote: For what grammar? What characters are allowed in a token? For example, Free Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, it doesn't need

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal > wrote: > > For what grammar? What characters are allowed in a token? For example, Free > Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, > it doesn't need to understand Unicode characters, so it

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Nikolay Nikolov via fpc-pascal
On 7/4/23 07:56, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 11:50 AM, Hairy Pixels wrote: You know you're right, with properly enclosed patterns you can capture everything inside and it works. You won't know if you had unicode in your string or not though but that depends on

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 4, 2023, at 9:58 AM, Nikolay Nikolov via fpc-pascal > wrote: > > You need to understand all these terms and know exactly what you need to do. > E.g. are you dealing with keyboard input, are you dealing with the low level > parts of text display, are you searching for something in

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Nikolay Nikolov via fpc-pascal
On 7/4/23 07:17, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 9:58 AM, Nikolay Nikolov via fpc-pascal wrote: You need to understand all these terms and know exactly what you need to do. E.g. are you dealing with keyboard input, are you dealing with the low level parts of text

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 4, 2023, at 11:50 AM, Hairy Pixels wrote: > > You know you're right, with properly enclosed patterns you can capture > everything inside and it works. You won't know if you had unicode in your > string or not though but that depends on what's being parsed and if you care > or not

Re: [fpc-pascal] ShortString still relevant today?

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 4, 2023, at 10:11 AM, Nikolay Nikolov via fpc-pascal > wrote: > > ShortString is mainly for compatibility with Turbo Pascal, not for > performance, IMHO. Although the FPC compiler itself still uses ShortString > for performance reasons (I think the main advantage is the avoidance

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 4, 2023, at 11:45 AM, Nikolay Nikolov via fpc-pascal > wrote: > > But you just don't need to do this, in order to tokenize Pascal. The > beginning and the end of the string literal is the apostrophe, which is > ASCII. The bear is a sequence of UTF-8 code units (opaque to the

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Nikolay Nikolov via fpc-pascal
On 7/4/23 08:08, Nikolay Nikolov wrote: On 7/4/23 07:56, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 11:50 AM, Hairy Pixels wrote: You know you're right, with properly enclosed patterns you can capture everything inside and it works. You won't know if you had unicode in your

Re: [fpc-pascal] ShortString still relevant today?

2023-07-03 Thread Hairy Pixels via fpc-pascal
Here is my test unit I'm playing with. It's crude but can anyone suggest what other things I could test? I'm playing with a string pointer also to see how ref counting/finalization plays in. Making your own managed typed using management operators is not tested but I'm sure it will be

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Nikolay Nikolov via fpc-pascal
On 7/4/23 07:45, Nikolay Nikolov wrote: On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal wrote: For what grammar? What characters are allowed in a token? For example, Free Pascal also has a parser/tokenizer, but since Pascal