Re: Raw string literals and Unicode escapes

Jim Laskey Tue, 13 Feb 2018 14:20:07 -0800

10a. String s = `abc`;
10b. String s = \u0060abc`;

As it stands both are legal. This decision has been mostly taken away from us 
because the lookahead of the previous token has “consumed" the character. There 
is little hope of finding out which form the backtick was derived. Not 
technically true in javac since we can sift back through the input buffer. 
Other tools may differ.  I’m going to ignore this remark in a second.


Choice: do we turn off escape processing on the first open backtick or the last 
open backtick? It doesn’t really matter as long as we do it before consuming 
the first non-backtick character.

Choice: do we turn on escape processing on the first close backtick or the last 
close backtick? It doesn’t matter as long as we do it before consuming the next 
non-backtick character. If we have an aborted close sequence (too few or too 
many backticks) then we have to turn it off again.

What about embedding \u0060 in a raw string?  If we treat them the same as 
backtick then the user is limited in the ways to express untranslated escapes. 
Note: We can always convert manually in the scanner by looking ahead for ‘\’, 
‘u’, ‘0’, ‘0’, ‘6’, ‘0’.

That all said, I think we should not allow \u0060 to represent a backtick in a 
raw string literal, ever. It complicates things unnecessarily and limits what 
the user can embed in the raw string.

So, change the scanner to

A) Peek back to make sure the first open backtick was exactly a backtick.
B) Turn off Unicode escapes immediately so that only backtick characters can be 
part of the delimiter.
C) Turn on Unicode escapes only after a valid closing delimiter is encountered.

Based on this all your examples are illegal.

— Jim



> On Feb 13, 2018, at 1:58 PM, Alex Buckley <alex.buck...@oracle.com> wrote:
> 
> I suspect the trickiest part of specifying raw string literals will be the 
> lexer's modal behavior for Unicode escapes. As such, I am going to put the 
> behavior under the microscope. Here is what the JEP has to say:
> 
> -----
> Unicode escapes, in the form \uxxxx, are processed as part of character input 
> prior to interpretation by the lexer. To support the raw string literal as-is 
> requirement, Unicode escape processing is disabled when the lexer encounters 
> an opening backtick and reenabled when encountering a closing backtick.
> -----
> 
> I would like to assume that if the lexer comes across the six tokens \ u 0 0 
> 6 0  then it should interpret them as a Unicode escape representing a 
> backtick _and then continue as if consuming the tokens of a raw string 
> literal_. However, the mention of _an_ opening backtick and _a_ closing 
> backtick gave me pause, given that repeated backticks can serve as the 
> opening delimiter and the closing delimiter. For absolute clarity, let's 
> write out examples to confirm intent: (Jim, please confirm or deny as you see 
> fit!)
> 
> 1.  String s = \u0060`;
> 
> Illegal. The RHS is lexed as ``;   which is disallowed by the grammar.
> 
> 2.  String s = \u0060Hello\u0060;
> 
> Illegal. The RHS is lexed as `Hello\u0060;   and so on for the rest of the 
> compilation unit -- the six tokens \ u 0 0 6 0 are not treated as a Unicode 
> escape since we're lexing a raw string literal. And without a closing 
> delimiter before the end of the compilation unit, a compile-time error occurs.
> 
> 3a.  String s = \u0060Hello`;
> 
> Legal. The RHS is lexed as `Hello`;   which is well formed.
> 
> 3b.  String s = \u0060\u0060Hello`;
> 
> Depends! If you take the JEP literally, then just the Unicode escape which 
> serves as the first opening backtick ("_an_ opening backtick") is enough to 
> enter raw-string mode. That makes the code legal: the RHS is lexed as 
> `\u0060Hello`;   which is well formed. On the other hand, you might think 
> that we shouldn't enter raw-string mode until the lexer in traditional mode 
> has lexed the opening delimiter fully (i.e. ALL the opening backticks). Then, 
> the code in 3b is illegal, because the opening delimiter (``) and the closing 
> delimiter (`) are not symmetric.
> 
> I think we should take the JEP literally, so that 3b is legal. And then, some 
> more examples:
> 
> 4a.  String s = \u0060`Hello``;
> 
> Legal. The RHS is lexed as ``Hello``;   which is well formed.
> 
> 4b.  String s = \u0060\u0060Hello``;
> 
> Illegal. The RHS is lexed as `\u0060Hello``;   which is disallowed by the 
> grammar. A raw string literal containing 11 tokens is immediately followed by 
> a ` token and a ; token which are not expected.
> 
> 4c.  String s = \u0060\u0060Hello`\u0060;
> 
> Depends! If you take the JEP literally, where _a_ closing backtick is enough 
> to re-enable Unicode escape processing, then the RHS is lexed as 
> `\u0060Hello``;  which is illegal per 4b. On the other hand, if you think 
> that we shouldn't re-enter traditional mode until the lexer in raw-string 
> mode has lexed the closing delimiter fully (i.e. ALL the closing backticks), 
> then presumably you think analogously about the opening delimiter, so the RHS 
> would be lexed as ``Hello`\u0060;   which is illegal per 2 (no closing 
> delimiter `` before the end of the compilation unit).
> 
> 5.  String s = \u0060`Hello`\u0060;
> 
> I put this here because it looks nice. It hits the same issues as 3b and 4c.
> 
> Alex

Re: Raw string literals and Unicode escapes

Reply via email to