Re: [fpc-pascal] Parse unicode scalar
> On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal > wrote: > > No - in this case, the "header" is the highest bit of that byte being 0. Oh it's the header BIT. Admittedly I don't understand how this function returns the highest bit using that case, which I think he was suggesting. function UTF8CodepointSizeFast(p: PChar): integer; begin case p^ of #0..#191 : Result := 1; #192..#223 : Result := 2; #224..#239 : Result := 3; #240..#247 : Result := 4; else Result := 1; // An optimization + prevents compiler warning about uninitialized Result. end; end; Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 3 July 2023 8:42:05 +0200, Hairy Pixels via fpc-pascal wrote: >> On Jul 3, 2023, at 12:04 PM, Mattias Gaertner via fpc-pascal >> wrote: >> >> No, the header of a codepoint to figure out the length. > >so the smallest character UTF-8 can represent is 2 bytes? 1 for the header and >1 for the character? > >ASCII #100 is the same character in UTF-8 but it needs a header byte, so 2 >bytes? No - in this case, the "header" is the highest bit of that byte being 0. Tomas ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 3 July 2023 9:12:03 +0200, Hairy Pixels via fpc-pascal wrote: >> On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal >> wrote: >> >> No - in this case, the "header" is the highest bit of that byte being 0. > >Oh it's the header BIT. Admittedly I don't understand how this function >returns the highest bit using that case, which I think he was suggesting. > >function UTF8CodepointSizeFast(p: PChar): integer; >begin > case p^ of > #0..#191 : Result := 1; > #192..#223 : Result := 2; > #224..#239 : Result := 3; > #240..#247 : Result := 4; > else Result := 1; // An optimization + prevents compiler warning about > uninitialized Result. > end; >end; That's why I wrote "in this case". The "header" itself is not fixed size either, but the algorithm above shows how you can derive the length from the first byte. Tomas ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 3, 2023, at 12:04 PM, Mattias Gaertner via fpc-pascal > wrote: > > No, the header of a codepoint to figure out the length. so the smallest character UTF-8 can represent is 2 bytes? 1 for the header and 1 for the character? ASCII #100 is the same character in UTF-8 but it needs a header byte, so 2 bytes? Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 3, 2023, at 3:05 PM, Mattias Gaertner via fpc-pascal > wrote: > > I wonder, is this thread about testing ChatGPT or do you want to > implement something useful? > There are already plenty of optimized UTF-8 functions in the FPC and > Lazarus sources. Maybe too many, and you have trouble finding the right > one? Just ask what your function needs to do. I was just curious how ChatGPTs implementation compared to other programmer. What I'm really trying to do is improve a parser so it can read UTF-8 files and decode unicode literals in the grammar. Right now I've just read the file into an AnsiString and indexing assuming a fixed character size, which breaks of course if non-1 byte characters exist I also need to know if I come across something like \u1F496 I need to convert that to a unicode character. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On Mon, 3 Jul 2023 15:27:10 +0700 Hairy Pixels via fpc-pascal wrote: >[...] > I was just curious how ChatGPTs implementation compared to other > programmer. Apparently the quality is often terrible. But it can be useful. > What I'm really trying to do is improve a parser so it can read UTF-8 > files and decode unicode literals in the grammar. First of all: Is it valid UTF-8 or do you have to check for broken or malicious sequences? > Right now I've just read the file into an AnsiString and indexing > assuming a fixed character size, which breaks of course if non-1 byte > characters exist Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful: function UTF8CodepointToUnicode(p: PChar; out CodepointLen: integer): Cardinal; > I also need to know if I come across something like \u1F496 I need > to convert that to a unicode character. I guess you know how to convert a hex to a dword. Then function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to UTF8 function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): integer; // UTF32 to UTF8 Mattias ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 3, 2023, at 4:29 PM, Mattias Gaertner via fpc-pascal > wrote: > >> What I'm really trying to do is improve a parser so it can read UTF-8 >> files and decode unicode literals in the grammar. > > First of all: Is it valid UTF-8 or do you have to check for broken or > malicious sequences? If they give the parser broken files that's their problem they need to fix? the user has control over the file so it's there responsibility I think. > > >> Right now I've just read the file into an AnsiString and indexing >> assuming a fixed character size, which breaks of course if non-1 byte >> characters exist > > Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful: > > function UTF8CodepointToUnicode(p: PChar; out CodepointLen: integer): > Cardinal; Not sure how this works. You need to advance by character so there return value should be the byte location of the next character or something like that. > > >> I also need to know if I come across something like \u1F496 I need >> to convert that to a unicode character. > > I guess you know how to convert a hex to a dword. Is there anything better than StrToInt? I wouldn't be able to do it myself though without that function. > Then > > function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to UTF8 > function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): integer; // UTF32 to > UTF8 > Ok I think this is basically what the other programmer submitted and what ChatGPT tried to do. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On Mon, 3 Jul 2023 12:01:11 +0700 Hairy Pixels via fpc-pascal wrote: > > On Jul 3, 2023, at 11:36 AM, Mattias Gaertner via fpc-pascal > > wrote: > > > > Useless array of. > > And it does not return the bytecount. > > it's an open array so what's the problem? >[...] > > Wrong for byteCount=1 > > really? How so? > > ChatGPT is risky because it will give wrong information with perfect > confidence and there's no way for the ignorant person to know. I wonder, is this thread about testing ChatGPT or do you want to implement something useful? There are already plenty of optimized UTF-8 functions in the FPC and Lazarus sources. Maybe too many, and you have trouble finding the right one? Just ask what your function needs to do. Mattias ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
El 03/07/2023 a las 10:27, Hairy Pixels via fpc-pascal escribió: Right now I've just read the file into an AnsiString and indexing assuming a fixed character size, which breaks of course if non-1 byte characters exist I also need to know if I come across something like \u1F496 I need to convert that to a unicode character. Hello, You are intermixing a lot of concepts, ASCII, Unicode, grapheme, representation, content, etc... Talking about Unicode you must forget ASCII, the text is a sequence of bytes which are encoded in a special format (UTF-8, UTF-16, UTF-32,...) and that must be represented in screen using Unicode representation rules, which are not the same as ASCII. Just to keep this message quite short, think in a text with only one "letter": "á" This text (text, not one letter, Unicode is about texts) can be transmitted or stored using Unicode encoding rules which are a sequence of bytes with its own rules to encode the information. Each byte is hexadecimal: UTF8: C3 A1 UTF16LE: 00 E1 UTF32: 00 00 00 E1 You must know in advance the encoding format to get the text from the bytes sequence. There is also a BOM (Byte Order Mark) which is sometimes used in files as a header to indicate the encoding, but in general it is not used. Now decoding that sequence of bytes, using the right decoding format you get a text which represent the letter "a" with an acute accent, but Unicode is *not* so *simple* and the same text could be represented in screen using letter "a" + "combining acute accent" and bytes sequence is totally different, different at encoding level but identical at renderization level. So this two UTF8 sequences: "C3 A1" and "61 CC 81" are different at grapheme level and encoding level but identical at representation level. Just as final note, this is the UTF-8 sequence of bytes for one single "character" in screen: F0 9F 8F B4 F3 A0 81 A7 F3 A0 81 A2 F3 A0 81 B3 F3 A0 81 A3 F3 A0 81 B4 F3 A0 81 BF Unicode is far, far from easy. Have a nice day. ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
[fpc-pascal] Lazarus Release Candidate 1 of 3.0
The Lazarus team is glad to announce the first release candidate of Lazarus 3.0. This release was built with FPC 3.2.2. Here is the list of changes for Lazarus and Free Pascal: http://wiki.lazarus.freepascal.org/Lazarus_3.0_release_notes http://wiki.lazarus.freepascal.org/User_Changes_3.2.2 Here is the list of fixes for Lazarus 3.x: https://gitlab.com/freepascal.org/lazarus/lazarus/-/commits/fixes_3_0/ The release is available for download on SourceForge: http://sourceforge.net/projects/lazarus/files/ Choose your CPU, OS, distro and then the "Lazarus 3.0RC1" directory. Checksums for the SourceForge files: https://www.lazarus-ide.org/index.php?page=checksums#3_0RC1 Minimum requirements: Windows: 2k, 32 or 64bit. FreeBSD/Linux: gtk 2.24 for gtk2, qt4.5 for qt, qt5.6 for qt5, 32 or 64bit. Mac OS X: Cocoa (64bit) 10.12, Carbon (32bit) 10.5 to 10.14, qt and qt5 (32 or 64bit). The gitlab page: https://gitlab.com/freepascal.org/lazarus/lazarus/-/tree/lazarus_3_0_RC1 For people who are blocked by SF, the Lazarus releases from SourceForge are mirrored at:ftp://ftp.freepascal.org/pub/lazarus/releases/ == Why should everybody (including you) test the release candidate? == In the past weeks the Lazarus team has stabilized the 3.0 fixes branch. The resulting 3.0RC1 is now stable enough to be used by any one for test purposes. However many of the fixes and new features that were committed since the release of 2.2.6 required changes to the code of existing features too. While we have tested those ourselves, there may still be problems that only occur with very specific configurations or one project in a million. Yes, it may be that you are the only person with a project, that will not work in the new IDE. So if you do not test, we can not fix it. Please do not wait for the final release, in order to test. It may be too late. Once the release is out we will have to be more selective about which fixes can be merged for further 3.x releases. So it may be, that we can not merge the fix you require. And then you will miss out on all the new features. == How to test == Download and install the 3.0 RC1. - On Windows you can install as a 2ndary install, that will not affect your current install: http://wiki.lazarus.freepascal.org/Multiple_Lazarus#Installation_of_multiple_Lazarus - On other platforms, if you install to a new location you need to use --primary-config-path In either case you should make backups. (including your primary config) Open your project in the current Lazarus (3.0), and use "Publish Project" from the project menu. This creates a clean copy of your project. You can then open that copy in the RC1. Please test: - If you can edit forms in the designer - rename components / change properties in Object inspector / Add new events - Add components to form / Move components on form - Frames, if you use them - If you can navigate the source code (e.g. jump to implementation) - Auto completion in source code - Compile, debug and run - Anything else you use in your daily work Mattias ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
Hi Ryan,I’ve created attached unit, which takes a code point and returns the utf8 char as a string. It’s based on the Wikipedia article on UTF8UTF-8 encodes code points in one to four bytes, depending on the value of the code point. The x characters are replaced by the bits of the code point:This table is copied from Wikipedia. uencoding.pas Description: Binary data Hope it’s useful for you. If you improve the code pls let me know.Best regards,JeroenOn 2 Jul 2023, at 15:30, Hairy Pixels via fpc-pascal wrote:I'm interested in parsing unicode scalars (I think they're called) to byte sized values but I'm not sure where to start. First thing I did was choose the unicode scalar U+1F496 ().Next I cheated and ask ChatGPT. :) Amazingly from my question it was able to tell me the scaler is comprised of these 4 bytes: 240 159 146 150I was able to correctly concatenate these characters and writeln printed the correct character.var s: String;begins := char(240)+char(159)+char(146)+char(150);writeln(s);end.The question is, how was 1F496 decomposed into 4 bytes? Regards, Ryan Joseph___fpc-pascal maillist - fpc-pascal@lists.freepascal.orghttps://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On Mon, 3 Jul 2023 14:12:03 +0700 Hairy Pixels via fpc-pascal wrote: > > On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal > > wrote: > > > > No - in this case, the "header" is the highest bit of that byte > > being 0. > > Oh it's the header BIT. Admittedly I don't understand how this > function returns the highest bit using that case, which I think he > was suggesting. A first byte of an UTF-8 codepoint is 0..127,192..247. The second, third, fourth byte are between 128..191, so you can easily detect where a codepoint starts. And from the first byte you can derive the length of the codepoint. If you just want to skip over n codepoints, then the below function does the job: > function UTF8CodepointSizeFast(p: PChar): integer; > begin > case p^ of >#0..#191 : Result := 1; >#192..#223 : Result := 2; >#224..#239 : Result := 3; >#240..#247 : Result := 4; >else Result := 1; // An optimization + prevents compiler warning > about uninitialized Result. end; > end; Mattias ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On Mon, 3 Jul 2023 17:18:56 +0700 Hairy Pixels via fpc-pascal wrote: >[...] > > First of all: Is it valid UTF-8 or do you have to check for broken > > or malicious sequences? > > If they give the parser broken files that's their problem they need > to fix? the user has control over the file so it's there > responsibility I think. Users responsibility? - I recommend to check for malicious codes. ;) > >> Right now I've just read the file into an AnsiString and indexing > >> assuming a fixed character size, which breaks of course if non-1 > >> byte characters exist > > > > Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful: > > > > function UTF8CodepointToUnicode(p: PChar; out CodepointLen: > > integer): Cardinal; > > Not sure how this works. You need to advance by character so there > return value should be the byte location of the next character or > something like that. function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt; // returns the number of codepoints var CodePointLen: longint; CodePoint: longword; begin Result:=0; while (ByteCount>0) do begin inc(Result); CodePoint:=UTF8CodepointToUnicode(p,CodePointLen); ...do something with the CodePoint... inc(p,CodePointLen); dec(ByteCount,CodePointLen); end; end; > >> I also need to know if I come across something like \u1F496 I need > >> to convert that to a unicode character. > > > > I guess you know how to convert a hex to a dword. > > Is there anything better than StrToInt? Good start. > I wouldn't be able to do it > myself though without that function. Hex to dword. That's easy enough for ChatGPT. > > function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to > > UTF8 function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): > > integer; // UTF32 to UTF8 > > Ok I think this is basically what the other programmer submitted and > what ChatGPT tried to do. Yes, no need to reinvent the wheel. Mattias ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal > wrote: > > function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt; > // returns the number of codepoints > var > CodePointLen: longint; > CodePoint: longword; > begin > Result:=0; > while (ByteCount>0) do begin >inc(Result); >CodePoint:=UTF8CodepointToUnicode(p,CodePointLen); >...do something with the CodePoint... >inc(p,CodePointLen); >dec(ByteCount,CodePointLen); > end; > end; Thanks, this looks right. I guess this is how we need to iterate over unicode now. Btw, why isn't there a for-loop we can use over unicode strings? seems like that should be supported out of the box. I had this same problem in Swift also where it's extremely confusing to merely iterate over a string and look at each character. Replacing characters will be tricky also so we need some good library functions. Swift is especially terrible because there's NO ANSII string so even a 1 byte sequence needs all these confusing as hell functions to do any work with strings at all. Terrible experience and slow. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
[fpc-pascal] ShortString still relevant today?
I've been exploring the string types and I'm curious now, does the classic Pascal "ShortString" even make sense anymore on modern computers? I'm running tests and I can't seem to find a way in which AnsiString overall performs worse than ShortString. Are there any examples where AnsiString is worse? I think if you passed strings around lots that would trigger the ref counting and InterlockedExchange (I saw this in my own code before and it unnerved me) but that's been hard to test. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 7/4/23 04:03, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal wrote: function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt; // returns the number of codepoints var CodePointLen: longint; CodePoint: longword; begin Result:=0; while (ByteCount>0) do begin inc(Result); CodePoint:=UTF8CodepointToUnicode(p,CodePointLen); ...do something with the CodePoint... inc(p,CodePointLen); dec(ByteCount,CodePointLen); end; end; Thanks, this looks right. I guess this is how we need to iterate over unicode now. Btw, why isn't there a for-loop we can use over unicode strings? seems like that should be supported out of the box. I had this same problem in Swift also where it's extremely confusing to merely iterate over a string and look at each character. Replacing characters will be tricky also so we need some good library functions. You're still confusing the Unicode terms. The above code iterates over Unicode Code Points, not "characters" in a UTF-8 encoded string. A Unicode Code Point is not a "character": https://unicode.org/glossary/#character https://unicode.org/glossary/#code_point There are also graphemes, grapheme clusters and extended grapheme clusters - these terms can also be perceived as "characters". https://unicode.org/glossary/#grapheme https://unicode.org/glossary/#grapheme_cluster https://unicode.org/glossary/#extended_grapheme_cluster If you want to iterate over extended grapheme clusters, for example, there's an iterator (written by me) in the unit graphemebreakproperty.pp in the rtl-unicode package. If you use the 'char' type in Pascal to iterate over an UTF-8 encoded string, you're iterating over Unicode code units (units! not code points! https://unicode.org/glossary/#code_unit). If you use the 'widechar' type in Pascal to iterate over a UnicodeString (which is a UTF-16 encoded string), you're also iterating over Unicode code units, however this time in UTF-16 encoding. If you want to iterate over Unicode code points (not units! not characters! not graphemes!) in a UTF-8 string, you need something like the ReadUTF8 function above. If you want to iterate over Unicode code points in a UTF-16 string, you need different code. You need to understand all these terms and know exactly what you need to do. E.g. are you dealing with keyboard input, are you dealing with the low level parts of text display, are you searching for something in the text, are you just passing strings around and letting the GUI deal with it? These are all different use cases, and they require careful understanding what Unicode thing you need to iterate over. Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] ShortString still relevant today?
On 7/4/23 04:19, Hairy Pixels via fpc-pascal wrote: I've been exploring the string types and I'm curious now, does the classic Pascal "ShortString" even make sense anymore on modern computers? I'm running tests and I can't seem to find a way in which AnsiString overall performs worse than ShortString. Are there any examples where AnsiString is worse? I think if you passed strings around lots that would trigger the ref counting and InterlockedExchange (I saw this in my own code before and it unnerved me) but that's been hard to test. ShortString is mainly for compatibility with Turbo Pascal, not for performance, IMHO. Although the FPC compiler itself still uses ShortString for performance reasons (I think the main advantage is the avoidance of the implicit try..finally blocks, needed for ansistrings). It might be interesting to benchmark the compiler with AnsiStrings instead of ShortStrings and see if there's a performance difference. But even if there is, a compiler is an extreme example. For 99% of the programs, performance impact of AnsiString is not an issue. I put {$H+} in almost all my new programs. I'd say that in 99% of the legit use cases, ShortString is used and needed for compatibility with legacy code, not for performance. Switching legacy code to {$H+} doesn't always work and may need additional fixes. Old code does things like S[0] := x instead of SetLength(S, x), etc. It also does uglier things, like FillChar() or Move() directly to/from string memory, or saves ShortStrings to files, as a part of a record, etc. Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal wrote: For what grammar? What characters are allowed in a token? For example, Free Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, it doesn't need to understand Unicode characters, so it works on the byte (Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code units). That's because UTF-8 has two nice properties: 1) ASCII character are encoded as they are - by using bytes in the range #0..#127 2) non-ASCII characters will always use a sequence of bytes, that are all in the range #128..#255 (they have their highest bit set), so they will never be misinterpreted as ASCII. So, the tokenizer just works with UTF-8 like with any other 8-bit code page. yes this works until you reach a non-ASCII ranged character and then the character index no longer matches the string 1 to 1. For example consider this was pascal: i := ''; You can advance by index like: Inc(currrentIndex); c := text[currentIndex]; but once you hit the bear the offset is now wrong so you can't advance to the next character by doing +1. But you just don't need to do this, in order to tokenize Pascal. The beginning and the end of the string literal is the apostrophe, which is ASCII. The bear is a sequence of UTF-8 code units (opaque to the compiler), that will not be mistaken for an apostrophe, or end of line, because they will have their high bit set. There's simply no need for a Pascal tokenizer to iterate over UTF-8 code points, instead of code units. Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal > wrote: > > For what grammar? What characters are allowed in a token? For example, Free > Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, > it doesn't need to understand Unicode characters, so it works on the byte > (Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code > units). That's because UTF-8 has two nice properties: > > 1) ASCII character are encoded as they are - by using bytes in the range > #0..#127 > > 2) non-ASCII characters will always use a sequence of bytes, that are all in > the range #128..#255 (they have their highest bit set), so they will never be > misinterpreted as ASCII. > > So, the tokenizer just works with UTF-8 like with any other 8-bit code page. yes this works until you reach a non-ASCII ranged character and then the character index no longer matches the string 1 to 1. For example consider this was pascal: i := ''; You can advance by index like: Inc(currrentIndex); c := text[currentIndex]; but once you hit the bear the offset is now wrong so you can't advance to the next character by doing +1. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 7/4/23 07:56, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 11:50 AM, Hairy Pixels wrote: You know you're right, with properly enclosed patterns you can capture everything inside and it works. You won't know if you had unicode in your string or not though but that depends on what's being parsed and if you care or not (I'm doing a TOML parser). Sorry I'm still curious even though it's not my current problem :) How can I make this program output the expected results: w: widechar; a: array of widechar; begin for w in 'abc' do a += [w]; // Outputs 7 instead of 4 writeln(length(a)); end; The user doesn't know about unicode they just want to get an array of characters and not worry about all these little details. What can FPC do to solve this problem? Depends on what you need, but I suppose in this case you want to count the number of extended grapheme clusters (a.k.a. "user perceived characters" - how many character-like things are displayed on the screen). You might be tempted to count the number of Unicode code points, but that's not the same, due to the existence of combining characters: https://en.wikipedia.org/wiki/Combining_character For extended grapheme clusters, there's an iterator in the graphemebreakproperty unit. I implemented this for the Unicode KVM and FreeVision. There it's needed for figuring out how many character blocks in the console will be needed to display a certain string. For the console or other GUIs that use fixed width fonts, there's also the East Asian Width property as well - some characters (East Asian - Chinese, Japanese, Korean) take double the space. So, to figure out where to move the cursor, you need to take East Asian Width as well. Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 4, 2023, at 9:58 AM, Nikolay Nikolov via fpc-pascal > wrote: > > You need to understand all these terms and know exactly what you need to do. > E.g. are you dealing with keyboard input, are you dealing with the low level > parts of text display, are you searching for something in the text, are you > just passing strings around and letting the GUI deal with it? These are all > different use cases, and they require careful understanding what Unicode > thing you need to iterate over. Thanks for trying to help but this is more complicated than I thought and I don't have the patience for a deep dive right now :) Unicode is complicated under the hood but we should have some libraries to help right? I mean the user thinks of these things as "characters" be it "A" or the unicode symbol so we should be able to operate on that basis as well. Something like an iterator that return the character (wide char) and byte offset or writing would be a nice place to start. I have a parser/tokenizer I want to update so I'm trying to find tokens by advancing one character at a time. That's why I have a requirement to know which character is next in the file and probably the byte offset also so it can be referenced later. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 7/4/23 07:17, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 9:58 AM, Nikolay Nikolov via fpc-pascal wrote: You need to understand all these terms and know exactly what you need to do. E.g. are you dealing with keyboard input, are you dealing with the low level parts of text display, are you searching for something in the text, are you just passing strings around and letting the GUI deal with it? These are all different use cases, and they require careful understanding what Unicode thing you need to iterate over. Thanks for trying to help but this is more complicated than I thought and I don't have the patience for a deep dive right now :) Unicode is complicated under the hood but we should have some libraries to help right? I mean the user thinks of these things as "characters" be it "A" or the unicode symbol so we should be able to operate on that basis as well. Something like an iterator that return the character (wide char) and byte offset or writing would be a nice place to start. I have a parser/tokenizer I want to update so I'm trying to find tokens by advancing one character at a time. That's why I have a requirement to know which character is next in the file and probably the byte offset also so it can be referenced later. For what grammar? What characters are allowed in a token? For example, Free Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, it doesn't need to understand Unicode characters, so it works on the byte (Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code units). That's because UTF-8 has two nice properties: 1) ASCII character are encoded as they are - by using bytes in the range #0..#127 2) non-ASCII characters will always use a sequence of bytes, that are all in the range #128..#255 (they have their highest bit set), so they will never be misinterpreted as ASCII. So, the tokenizer just works with UTF-8 like with any other 8-bit code page. Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 4, 2023, at 11:50 AM, Hairy Pixels wrote: > > You know you're right, with properly enclosed patterns you can capture > everything inside and it works. You won't know if you had unicode in your > string or not though but that depends on what's being parsed and if you care > or not (I'm doing a TOML parser). Sorry I'm still curious even though it's not my current problem :) How can I make this program output the expected results: w: widechar; a: array of widechar; begin for w in 'abc' do a += [w]; // Outputs 7 instead of 4 writeln(length(a)); end; The user doesn't know about unicode they just want to get an array of characters and not worry about all these little details. What can FPC do to solve this problem? Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] ShortString still relevant today?
> On Jul 4, 2023, at 10:11 AM, Nikolay Nikolov via fpc-pascal > wrote: > > ShortString is mainly for compatibility with Turbo Pascal, not for > performance, IMHO. Although the FPC compiler itself still uses ShortString > for performance reasons (I think the main advantage is the avoidance of the > implicit try..finally blocks, needed for ansistrings). It might be > interesting to benchmark the compiler with AnsiStrings instead of > ShortStrings and see if there's a performance difference. But even if there > is, a compiler is an extreme example. For 99% of the programs, performance > impact of AnsiString is not an issue. I put {$H+} in almost all my new > programs. I'd say that in 99% of the legit use cases, ShortString is used and > needed for compatibility with legacy code, not for performance. Switching > legacy code to {$H+} doesn't always work and may need additional fixes. Old > code does things like S[0] := x instead of SetLength(S, x), etc. It also does > uglier things, like FillChar() or Move() directly to/from string memory, or > saves ShortStrings to files, as a part of a record, etc. One thing I can think of now is that adding an AnsiString to a record or class makes that type "managed" and so it needs extra finalization when going out of scope. Static arrays will need to finalize their members too and the RTL has to do extra work in the lists classes to ensure this happens , which bloats the generic container types. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 4, 2023, at 11:45 AM, Nikolay Nikolov via fpc-pascal > wrote: > > But you just don't need to do this, in order to tokenize Pascal. The > beginning and the end of the string literal is the apostrophe, which is > ASCII. The bear is a sequence of UTF-8 code units (opaque to the compiler), > that will not be mistaken for an apostrophe, or end of line, because they > will have their high bit set. There's simply no need for a Pascal tokenizer > to iterate over UTF-8 code points, instead of code units. You know you're right, with properly enclosed patterns you can capture everything inside and it works. You won't know if you had unicode in your string or not though but that depends on what's being parsed and if you care or not (I'm doing a TOML parser). Maybe I can skip that part and just focus on the decoding of the unicode scalars Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 7/4/23 08:08, Nikolay Nikolov wrote: On 7/4/23 07:56, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 11:50 AM, Hairy Pixels wrote: You know you're right, with properly enclosed patterns you can capture everything inside and it works. You won't know if you had unicode in your string or not though but that depends on what's being parsed and if you care or not (I'm doing a TOML parser). Sorry I'm still curious even though it's not my current problem :) How can I make this program output the expected results: w: widechar; a: array of widechar; begin for w in 'abc' do a += [w]; // Outputs 7 instead of 4 writeln(length(a)); end; The user doesn't know about unicode they just want to get an array of characters and not worry about all these little details. What can FPC do to solve this problem? Depends on what you need, but I suppose in this case you want to count the number of extended grapheme clusters (a.k.a. "user perceived characters" - how many character-like things are displayed on the screen). You might be tempted to count the number of Unicode code points, but that's not the same, due to the existence of combining characters: https://en.wikipedia.org/wiki/Combining_character For extended grapheme clusters, there's an iterator in the graphemebreakproperty unit. I implemented this for the Unicode KVM and FreeVision. There it's needed for figuring out how many character blocks in the console will be needed to display a certain string. For the console or other GUIs that use fixed width fonts, there's also the East Asian Width property as well - some characters (East Asian - Chinese, Japanese, Korean) take double the space. So, to figure out where to move the cursor, you need to take East Asian Width as well. For console apps that use the Unicode KVM video unit, I've introduced two functions for determining the display width of a Unicode string in the video unit: function ExtendedGraphemeClusterDisplayWidth(const EGC: UnicodeString): Integer; { Returns the number of display columns needed for the given extended grapheme cluster } function StringDisplayWidth(const S: UnicodeString): Integer; { Returns the number of display columns needed for the given string } Remember, the display width is different than the number of graphemes, due to East Asian double width characters. And these work with UnicodeString, which is UTF-16, not UTF-8. But Free Pascal can convert between the two. Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] ShortString still relevant today?
Here is my test unit I'm playing with. It's crude but can anyone suggest what other things I could test? I'm playing with a string pointer also to see how ref counting/finalization plays in. Making your own managed typed using management operators is not tested but I'm sure it will be terrible compared to everything else. * test_short_string time: 143ms * test_ansi_string time: 115ms * test_mem_string time: 115ms * test_short_string_record time: 165ms * test_ansi_string_record time: 75ms * test_mem_string_record time: 47ms * test_short_string_mutate time: 203ms * test_ansi_string_mutate time: 181ms === {$mode objfpc} program string_test; uses SysUtils, DateUtils; const ITERATIONS = 1000 * 1000; TEST_STRING = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit'; type TTestProc = procedure; procedure test_mem_string; procedure do_pass(const s: PString; len: Integer); var c: Char; i: Integer; begin for i := 0 to len - 1 do c := s^[i]; end; var s: PString; i, len: Integer; begin for i := 0 to ITERATIONS - 1 do begin len := Length(TEST_STRING); s := GetMem(len); s^ := TEST_STRING; do_pass(s, len); FreeMem(s); end; end; procedure test_short_string; procedure do_pass(const s: ShortString); var c: Char; i: Integer; begin for i := 0 to length(s) - 1 do c := s[i]; end; var s: Shortstring; i: Integer; begin for i := 0 to ITERATIONS - 1 do begin s := TEST_STRING; do_pass(s); end; end; procedure test_ansi_string; procedure ansi_string_pass(const s: AnsiString); var c: Char; i: Integer; begin for i := 0 to length(s) - 1 do c := s[i]; end; var s: AnsiString; i: Integer; begin for i := 0 to ITERATIONS - 1 do begin s := TEST_STRING; ansi_string_pass(s); end; end; procedure test_ansi_string_mutate; var i, j: Integer; s1, s2: AnsiString; begin for i := 0 to ITERATIONS - 1 do begin s1 := TEST_STRING; s2 := s1 + IntToStr(i); for j := 1 to length(s2) - 1 do s2[j] := 'x'; end; end; procedure test_short_string_mutate; var i, j: Integer; s1, s2: ShortString; begin for i := 0 to ITERATIONS - 1 do begin s1 := TEST_STRING; s2 := s1 + IntToStr(i); for j := 1 to length(s2) - 1 do s2[j] := 'x'; end; end; procedure test_short_string_record; type TMyRecord = record a: ShortString; b: ShortString; c: ShortString; end; function do_pass(rec: TMyRecord): TMyRecord; begin result := rec; end; var i: Integer; r: TMyRecord; begin for i := 0 to ITERATIONS - 1 do begin r.a := TEST_STRING; r.b := TEST_STRING; r.c := TEST_STRING; do_pass(r); end; end; procedure test_ansi_string_record; type TMyRecord = record a: AnsiString; b: AnsiString; c: AnsiString; end; function do_pass(rec: TMyRecord): TMyRecord; begin result := rec; end; var i: Integer; r: TMyRecord; begin for i := 0 to ITERATIONS - 1 do begin r.a := TEST_STRING; r.b := TEST_STRING; r.c := TEST_STRING; do_pass(r); end; end; procedure test_mem_string_record; type TMyRecord = record a: PString; b: PString; c: PString; end; function do_pass(rec: TMyRecord): TMyRecord; begin result := rec; end; var i: Integer; r: TMyRecord; len: Integer; begin len := Length(TEST_STRING); for i := 0 to ITERATIONS - 1 do begin r.a := GetMem(len); r.b := GetMem(len); r.c := GetMem(len); r.a^ := TEST_STRING; r.b^ := TEST_STRING; r.c^ := TEST_STRING; do_pass(r); end; end; procedure run_test(name: String; test: TTestProc); var startTime: Double; begin startTime := Now; test; writeln('* ', name,' time: ', MilliSecondsBetween(Now, StartTime), 'ms'); end; begin run_test('test_short_string', @test_short_string); run_test('test_ansi_string', @test_ansi_string); run_test('test_mem_string', @test_ansi_string); run_test('test_short_string_record', @test_short_string_record); run_test('test_ansi_string_record', @test_ansi_string_record); run_test('test_mem_string_record', @test_mem_string_record); run_test('test_short_string_mutate', @test_short_string_mutate); run_test('test_ansi_string_mutate', @test_ansi_string_mutate); end. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 7/4/23 07:45, Nikolay Nikolov wrote: On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal wrote: For what grammar? What characters are allowed in a token? For example, Free Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, it doesn't need to understand Unicode characters, so it works on the byte (Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code units). That's because UTF-8 has two nice properties: 1) ASCII character are encoded as they are - by using bytes in the range #0..#127 2) non-ASCII characters will always use a sequence of bytes, that are all in the range #128..#255 (they have their highest bit set), so they will never be misinterpreted as ASCII. So, the tokenizer just works with UTF-8 like with any other 8-bit code page. yes this works until you reach a non-ASCII ranged character and then the character index no longer matches the string 1 to 1. For example consider this was pascal: i := ''; You can advance by index like: Inc(currrentIndex); c := text[currentIndex]; but once you hit the bear the offset is now wrong so you can't advance to the next character by doing +1. But you just don't need to do this, in order to tokenize Pascal. The beginning and the end of the string literal is the apostrophe, which is ASCII. The bear is a sequence of UTF-8 code units (opaque to the compiler), that will not be mistaken for an apostrophe, or end of line, because they will have their high bit set. There's simply no need for a Pascal tokenizer to iterate over UTF-8 code points, instead of code units. Sorry, the last sentence should read: "There's simply no need for a Pascal tokenizer to iterate over Unicode code points, instead of UTF-8 code units." Hope that makes it more clear and accurate. Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal