On Mon, 3 Jul 2023 15:27:10 +0700 Hairy Pixels via fpc-pascal <fpc-pascal@lists.freepascal.org> wrote:
>[...] > I was just curious how ChatGPTs implementation compared to other > programmer. Apparently the quality is often terrible. But it can be useful. > What I'm really trying to do is improve a parser so it can read UTF-8 > files and decode unicode literals in the grammar. First of all: Is it valid UTF-8 or do you have to check for broken or malicious sequences? > Right now I've just read the file into an AnsiString and indexing > assuming a fixed character size, which breaks of course if non-1 byte > characters exist Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful: function UTF8CodepointToUnicode(p: PChar; out CodepointLen: integer): Cardinal; > I also need to know if I come across something like \u1F496 I need > to convert that to a unicode character. I guess you know how to convert a hex to a dword. Then function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to UTF8 function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): integer; // UTF32 to UTF8 Mattias _______________________________________________ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal