Re: [fpc-pascal] Parse unicode scalar

Mattias Gaertner via fpc-pascal Mon, 03 Jul 2023 02:30:09 -0700

On Mon, 3 Jul 2023 15:27:10 +0700
Hairy Pixels via fpc-pascal <fpc-pascal@lists.freepascal.org> wrote:


>[...]
> I was just curious how ChatGPTs implementation compared to other
> programmer.

Apparently the quality is often terrible. But it can be useful.

 
> What I'm really trying to do is improve a parser so it can read UTF-8
> files and decode unicode literals in the grammar.

First of all: Is it valid UTF-8 or do you have to check for broken or
malicious sequences?

 
> Right now I've just read the file into an AnsiString and indexing
> assuming a fixed character size, which breaks of course if non-1 byte
> characters exist

Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful:

function UTF8CodepointToUnicode(p: PChar; out CodepointLen: integer): Cardinal;

 
>  I also need to know if I come across something like \u1F496 I need
> to convert that to a unicode character.

I guess you know how to convert a hex to a dword. Then

function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to UTF8
function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): integer; // UTF32 to 
UTF8

Mattias
_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

Reply via email to