On Feb 13, 2018, at 9:58 AM, Alex Buckley <alex.buck...@oracle.com> wrote:
> I suspect the trickiest part of specifying raw string literals will be the
> lexer's modal behavior for Unicode escapes. As such, I am going to put the
> behavior under the microscope.
For an approach to this see:
In short: We define a so-called "preimage" for each token,
which is the unambiguously defined sequence of UTF-16
code points that translate to that token via \u substitution
and line terminator normalization.
For raw strings (only) the preimage of a token is significant.
The backticks of a raw string (both opening and closing)
are required to be their own preimage (no \u0060 allowed).
And the raw string body contents are the preimage of the
string token, not the normal token image.
I think preimage is the trick we need here, and it settles
a number of questions, such as those you raised.
All of the tricky examples you raised are uniformly illegal,
under the preimage rule for raw-string quotes.