On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote:

On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal 
<fpc-pascal@lists.freepascal.org> wrote:

For what grammar? What characters are allowed in a token? For example, Free 
Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, 
it doesn't need to understand Unicode characters, so it works on the byte 
(Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code 
units). That's because UTF-8 has two nice properties:

1)  ASCII character are encoded as they are - by using bytes in the range 
#0..#127

2) non-ASCII characters will always use a sequence of bytes, that are all in 
the range #128..#255 (they have their highest bit set), so they will never be 
misinterpreted as ASCII.

So, the tokenizer just works with UTF-8 like with any other 8-bit code page.
yes this works until you reach a non-ASCII ranged character and then the 
character index no longer matches the string 1 to 1. For example consider this 
was pascal:

i := '🐻';

You can advance by index like:

  Inc(currrentIndex);
  c := text[currentIndex];

but once you hit the bear the offset is now wrong so you can't advance to the 
next character by doing +1.

But you just don't need to do this, in order to tokenize Pascal. The beginning and the end of the string literal is the apostrophe, which is ASCII. The bear is a sequence of UTF-8 code units (opaque to the compiler), that will not be mistaken for an apostrophe, or end of line, because they will have their high bit set. There's simply no need for a Pascal tokenizer to iterate over UTF-8 code points, instead of code units.

Nikolay

_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Reply via email to