On 2/13/2018 2:11 PM, John Rose wrote:
On Feb 13, 2018, at 9:58 AM, Alex Buckley <alex.buck...@oracle.com
<mailto:alex.buck...@oracle.com>> wrote:

I suspect the trickiest part of specifying raw string literals will be
the lexer's modal behavior for Unicode escapes. As such, I am going to
put the behavior under the microscope.

For an approach to this see:

In short:  We define a so-called "preimage" for each token,
which is the unambiguously defined sequence of UTF-16
code points that translate to that token via \u substitution
and line terminator normalization.

For raw strings (only) the preimage of a token is significant.
The backticks of a raw string (both opening and closing)
are required to be their own preimage (no \u0060 allowed).
And the raw string body contents are the preimage of the
string token, not the normal token image.

I think preimage is the trick we need here, and it settles
a number of questions, such as those you raised.
All of the tricky examples you raised are uniformly illegal,
under the preimage rule for raw-string quotes.

I agree that holding on to the preimage of each InputElement (JLS 3.5) is necessary because ` can legitimately appear in some kinds of InputElement as an ordinary InputCharacter (derived from either the RawInputCharacter ` or the UnicodeEscape \u0060):

1.  Comment

    // This Markdown processor treats ` specially.
    /* This Markdown processor treats \u0060 specially. */

2.  Token (and more specifically, StringLiteral)

    "Hi `Bob`"
    "Hi \u0060Bob\u0060"

Only if the InputElement is a Token, and more specifically a RawStringLiteral, do we need to take the sequence of InputCharacters and LineTerminators that constitute its RawStringBody and replace that sequence with its preimage.

I want to say something about the delimiters of the raw string literal now, but I'll do that in response to Jim's mail.


Reply via email to