Re: Raw string literals and Unicode escapes

Alex Buckley Wed, 14 Feb 2018 11:47:40 -0800

On 2/13/2018 2:11 PM, John Rose wrote:

On Feb 13, 2018, at 9:58 AM, Alex Buckley <alex.buck...@oracle.com
<mailto:alex.buck...@oracle.com>> wrote:


I suspect the trickiest part of specifying raw string literals will be
the lexer's modal behavior for Unicode escapes. As such, I am going to
put the behavior under the microscope.


For an approach to this see:
http://cr.openjdk.java.net/~jrose/jls/raw-string-pages-v4.pdf

In short:  We define a so-called "preimage" for each token,
which is the unambiguously defined sequence of UTF-16
code points that translate to that token via \u substitution
and line terminator normalization.

For raw strings (only) the preimage of a token is significant.
The backticks of a raw string (both opening and closing)
are required to be their own preimage (no \u0060 allowed).
And the raw string body contents are the preimage of the
string token, not the normal token image.

I think preimage is the trick we need here, and it settles
a number of questions, such as those you raised.
All of the tricky examples you raised are uniformly illegal,
under the preimage rule for raw-string quotes.

I agree that holding on to the preimage of each InputElement (JLS 3.5)is necessary because ` can legitimately appear in some kinds ofInputElement as an ordinary InputCharacter (derived from either theRawInputCharacter ` or the UnicodeEscape \u0060):


1.  Comment

    // This Markdown processor treats ` specially.
    /* This Markdown processor treats \u0060 specially. */

2.  Token (and more specifically, StringLiteral)

    "Hi `Bob`"
    "Hi \u0060Bob\u0060"

Only if the InputElement is a Token, and more specifically aRawStringLiteral, do we need to take the sequence of InputCharacters andLineTerminators that constitute its RawStringBody and replace thatsequence with its preimage.

I want to say something about the delimiters of the raw string literalnow, but I'll do that in response to Jim's mail.


Alex

Re: Raw string literals and Unicode escapes

Reply via email to