Re: [fpc-pascal] Parse unicode scalar

Nikolay Nikolov via fpc-pascal Mon, 03 Jul 2023 21:45:40 -0700


On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote:

On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal 
<fpc-pascal@lists.freepascal.org> wrote:

For what grammar? What characters are allowed in a token? For example, Free 
Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, 
it doesn't need to understand Unicode characters, so it works on the byte 
(Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code 
units). That's because UTF-8 has two nice properties:

1)  ASCII character are encoded as they are - by using bytes in the range 
#0..#127

2) non-ASCII characters will always use a sequence of bytes, that are all in 
the range #128..#255 (they have their highest bit set), so they will never be 
misinterpreted as ASCII.

So, the tokenizer just works with UTF-8 like with any other 8-bit code page.

yes this works until you reach a non-ASCII ranged character and then the 
character index no longer matches the string 1 to 1. For example consider this 
was pascal:

i := '🐻';

You can advance by index like:

  Inc(currrentIndex);
  c := text[currentIndex];

but once you hit the bear the offset is now wrong so you can't advance to the 
next character by doing +1.

But you just don't need to do this, in order to tokenize Pascal. Thebeginning and the end of the string literal is the apostrophe, which isASCII. The bear is a sequence of UTF-8 code units (opaque to thecompiler), that will not be mistaken for an apostrophe, or end of line,because they will have their high bit set. There's simply no need for aPascal tokenizer to iterate over UTF-8 code points, instead of code units.


Nikolay

_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

Reply via email to