Re: Raw string literals and Unicode escapes

Maurizio Cimadamore Tue, 27 Feb 2018 02:56:57 -0800


On 27/02/18 08:16, fo...@univ-mlv.fr wrote:

Hi John,
see below.

----- Mail original -----

De: "John Rose" <john.r.r...@oracle.com>
À: "Remi Forax" <fo...@univ-mlv.fr>
Cc: "amber-spec-experts" <amber-spec-experts@openjdk.java.net>
Envoyé: Lundi 26 Février 2018 21:17:13
Objet: Re: Raw string literals and Unicode escapes
On Feb 26, 2018, at 10:43 AM, Alex Buckley <alex.buck...@oracle.com> wrote:

On 2/25/2018 4:19 AM, Remi Forax wrote:

I'm late in the game but why not using the same system as Perl, PHP,
Ruby to solve the Lts [1], i.e
you have a sequence that says this is the starts of a raw string (%Q,
qq, m) then a character (in a predefined list), the raw string and at
the end of the raw string the same character as at the beginning (or its
mirror).

By example, this 'raw' as prefix for a raw string
raw`this is a raw string`
raw'this is another raw string'
raw[yet another raw string]

See "Choice of Delimiters" in the "Alternatives" section of the JEP.

The JEP doesn't clearly call out the goal of *no* escapes in the bulk
of the raw string, but that requirement (which we have adopted)
affects the choice of quotes in a decisive manner.  Let me try to
lay out the "string physics" that underly this decision.

*Any* single-character end-quote will have a significant probability
of showing up inside the bulk of a (randomly selected) raw string.

How significant?  Well, let's say conservatively that raw strings
can have all possible characters, but the end-quote sequence
only shows up one out of a hundred times, per character position,
in raw strings.  If you are using a series of ten-character raw
strings (to say nothing of bigger ones), you have about a 10%
chance for any given raw string to contain an inconvenient
end-quote.

That percentage is significant, especially given that in some
cases strings will be longer and quote characters will be more
common, both factors increasing the failure rate beyond 10%.
But even a 0.1% failure rate is noticeable to users, making a
feature feel unreliable.

This generalizes to any fixed multi-character end-quote, with a
reduction of probability exponential in the length of the end-quote,
but still with a non-zero probability, of occurring in the bulk of
a randomly selected string.  A two-character end-quote might
have a probability of 10^-4, and that means you have a more
modest but still significant chance of failure of 10% across a
suite of 100 random 10-character strings, or for one random
1000-character string.

Any *finite choice* of end-quotes has the same problem, with
a non-zero probability that decreases (but does not vanish)
with the number of available end-quotes.  The only way to
break out of the box is to allow the user an unlimited range
of successively "stronger" end-quotes (i.e., less likely ones).

(Randomly selected raw strings are easy to model, although
the numbers used above are an approximation to a binomial
distribution.  In fact, though, strings which show up non-randomly
in real code are *more* likely to mention end-quotes, since their
contents are somehow correlated to the enclosing language.)

You can easily demonstrate this issue by nesting Java code
which uses raw quotes inside of a containing raw quote.  An
easy first test of a proposed quoting mechanism is, "will it
nest?"  If not, then the quoting mechanism does not meet
a key requirement for raw quotes.

This key requirement is unconstrained pasting *without* fixups
(escape sequences embedded in the bulk of the quote).
Anything else, with some epsilon probability of requiring escapes,
is not truly raw, just "mostly raw".

In the case you propose, Remi, the probability of having an
un-quotable bulk string is quite high, since all of the end-quotes
are single characters.

Only a convention with an end-quote of arbitrary length is strong
enough to "fence in" arbitrary raw strings.  The simplest possible
such convention is to allow replication of a single character to
serve as the end-quote.  This decision toward simplicity
influences other features in Java raw strings, including the
decision to use a new character and to disallow certain
edge cases, notably null strings.

— John


I understand your point but i disagree with your analysis.
My own experience is that raw strings follow what i call the 'embedded 
languages' hypothesis,
i.e. for any application, there is a length such all raw strings with a length 
greater than this length contain only embedded programming languages.
So after this length instead of having the probability to see a character to be 
virtually 1, you have the opposite effect, because programming languages (a 
human construct) are very regular in the set of chars they use. So you do not 
need to a repetition of a character to avoid a statistical effect that does not 
occur. Being able to choose the escape character, is enough.

W/o diving too much on the repeated vs. 'single but customizable'choice, I'm also a bit suspicious of the fact that John's analysisconservatively assumes that a snippet of text embedded in a raw stringis a random sequence of character, in the true sense. This, to me, justseems the wrong assumption - by definition something truly random hashigh entropy and something with high entropy is usually associated withlow information content - which is just not compatible with the use caseof 'pasting in a code snippet' (example: it's highly likely that theprefix 'cla' will be followed by 'ss' in a Java-like snippet). I wouldexpect entropy of the embedded snippet to be quite low compared to theassumption made here, which greatly affects the probabilitycalculations. For the analysis to be correct, it should take intoaccount the _frequency_ by which a given delimiter can appear in thevarious kinds of snippets that could be pasted in (and there's one suchfrequency for each snippet kind) - or we're at risk of overestimating(if we pick a delimiter symbol whose frequency is, in reality, reallylow), or underestimating (if we pick a symbol that, conversely, happensvery frequently).


Maurizio

P.S. I expect IDE vendors will quickly supply useful "stretchy quotes"
which will resize themselves to contain whatever users throw into
the raw string body.  At that point backticks will feel like magic tokens
that never accidentally match raw string bodies.

regards,
Rémi

Re: Raw string literals and Unicode escapes

Reply via email to