TL;DR: Good framework; must also account for the rectangle extraction rule (RER). A unified escape sublanguage (ESL) is highly desirable, and I propose adding <\ > and <\ LT WS*> as escapes for space and for null string. The existing \ char is OK, and should be "fattened" as a separate feature. I note some issues with <\ u X X X X>.
On Apr 28, 2019, at 1:32 PM, Brian Goetz <brian.go...@oracle.com> wrote: > - Opening delimiter > - Closing delimiter > - Escape characters, if any > - Escape sublanguages, if any Yes, this is a useful way to break down the syntax. You left out padding conventions as a degree of freedom. Padding conventions given the programmer detailed control over the format of the program by associating non-payload characters with the string literal. Whitespace rectangle extraction is the only padding convention we are discussing, plus occasional suggestions that we remove horizontal space in one-line fat strings. If we denote today's escape sublanguage as ESL and the rectangle extraction rule as RER, then today's literals are: ThinString=SL[open=close=", escape=\, esl=ESL, pc=none] Tomorrow's fat strings will be something like: FatString=SL[open=close=""", escape=\, esl=ESL, pc=RER] Another aspect of defining a string literal is the *phasing* of the different features. I think we have good consensus that padding should be stripped *before* escape interpretation, so that escaped characters are not mistaken for padding characters. > I bring this up not because I want to talk about raw-ness now (getting the > hint?), but because I want to keep all the variations of string literals as > lightly-varying projections of the same basic feature. Understanding the variations is important. It also gives me hope that we could parley this framework, later on, into something strong. <digression> In the future (not now) we might add a parameterized range of these schemes: StrongString<N>=SL[open=close=F(N), escape=G(N), esl=ESL, pc=RER] for some functions F, G that enumerate quote and escape tokens. This would be a strong quoting scheme that could (with care) allow any given payload string S to be embedded without the need for escapes, by choosing an N for which F(N) and G(N) do not occur in S. </digression> Getting back to today, I want to talk about escapes. First, I'll remind us all that the RER is part of fat strings and that therefore the newline and space characters are no longer just passive string body characters, but rather play a role in the string syntax. This means that the ESL needs to be upgraded so that occurrences of strings and newlines which otherwise would play a role in syntax can be escaped. I think this at a minimum means that the ESL needs to add support for the two character escape sequence <\ space>. There is already an escape sequence for a line terminator; it is <\ n>. A similar point holds for <\ t>. These three escapes (one new, two old) are enough to allow a programmer to tell the RER to stay away from a particular bit of white-space. (Note that if the RER were to happen *after* escape processing, we'd be in a pickle: There's be no way to use the existing ESL to control the RER, and we'd have to put some sort of extra control feature into the RER itself, or settle for an uncontrollable RER.) > It has come up, for example, that we might treat \<newline> differently in ML > strings as in classic strings, My own suggestions in this vein have nothing to do with making a new ESL but with extending the old one so it works well with fat strings. > but I would prefer it we could not tinker with the escape language in > nonuniform ways — as this minimizes the variations between the various > sub-features. I agree that we should have only one ESL; there's no reason to have different "dialects" of it in different types of strings. So <\ space> should be added to the ESL, not because it's particularly useful for thin strings, but because it escapes otherwise strippable padding in fat strings. Here's an interesting feature of the JLS: It defines a uniform ESL for both string and character literals. This means that <\ '> can occur in both kinds of literals, even though it is only needed for character literals. Same point in reverse for <\ ">. Since the ESL is uniform, if *one* kind of literal needs a particular escape sequence, then *all* the literals have it. (See where I'm going?) Now, the upcoming features of fat strings includes a padding convention, ergo the common ESL needs a way to escape the now-syntactic padding characters. About <\ LT> (an escaped LineTerminator), a similar point holds: Sure it's useful only in string literals with line terminators, but if there is a legitimate reason to add extra control over LTs, then <\ LT> gets bundled into the common escape sublanguage of the JLS. There are two interesting questions about positioning <\ LT> as an escape sequence: 1. What does <\ LT> mean, if it is legal and not just an alias for <\ n>? 2. Is <\ LT> allowed in a thin string, given that (currently) the thin string syntax rejects LT? For 1. I'm already on record as proposing that <\ LT WS*> is an escape sequence for the null string. (WS is horizontal whitespace.) For 2., if we say "no" then we seem to come close to forking the ESL, which Brian and I want to avoid. A thin string body is a sequence of regular non-LT chars plus escape sequences, except <\ LT>. A fat string body can include <\ LT> as well as other escape sequences. But that is not really a fork of the ESL. The difference between fat and thin strings is a structural constraint on their bodies, before escape processing: A fat string can contain LT in its pre-escape-processed body, and so in fact can contain <\ LT>. A thin string cannot contain LT at all, so the presence of <\ LT> in the ESL is moot for a thin string. (Also moot for a char literal.) The parsing of a string literal (either kind) consists of gathering an escaped string body while looking for the close-quote. The close-quote interrupts the body and terminates the string. For the case of a thin string, an LT also interrupts the body, but causes parsing to fail. So we could answer "no" to 2 and keep a unified ESL, simply by asserting that thin string tokens never contain LT, while fat string tokens contain LT (always? different question). We could also answer "yes" to 2, and I think it's worth a discussion. What I'm suggesting here is that the thin strings are allowed to contain *escaped* LTs in a new version of the JLS (that also contains fat strings). The pre-escape-processed body of either kind of string can contain escaped LTs, and fat strings can *also* contain *unescaped* LTs. Example: var ts = "hel\ lo\ "; assert ts == "hello"; var fs = """ hel\ lo\ """; assert fs == "hello"; In the latter case, the RER strips most or all of the whitespace. In any case <\ LT WS*> sops up the rest. The reason we are discussing <\ LT> is that there are plenty of reasons why programmers would wish to control the format of their programs by breaking up long logical lines into shorter physical lines. Such use cases are not specific to payloads with or without newlines. If your payload has newlines, use a fat string *and* break up long logical lines into shorter physical ones. If you payload has no newlines (maybe it's a very long hex number), then use a thin string, and break it up. The RER of fat strings (which I like!) prompts the discussion of breaking up logical lines into physical ones, more than thin strings. After all, with thin strings, you break one line into two lines, it's a given that you are going to write two literals, and then the + sign (for concatenation) adds no additional overhead. The break-up sequence is something like <" LT WS + "> But if you have a large MLS with a few very long logical lines, suddenly you have an invidious choice between keeping your nice rectangle, or disrupting it totally by adding <" LT WS + ">. Breaking a long line in this case drops you off a syntax cliff. Supporting </ LT WS> lets you down easy, by breaking the logical lines without disrupting the enclosing padding of the rectangle extraction rule. > Soliciting discussion on the pros and cons of keeping \ as our escape > character. Well, \ makes a very fine escape character, except for particular payloads when it doesn't. Any payload which is a program in some little language that uses \ for escaping is going get confusing very fast. Nobody wants to count a train of escapes, and layers of escaping cause escape trains to lengthen fast (doubling with each layer). Regular expressions are the poster child, and I'll just pretend that they are the key use case, since they are the worst-behaved. Fattening \ to \\\ helps a little with REs. But it would make long trains even longer, with the result that you would need even more help keeping count. The eye can only count a small number of repeated characters at a glance. var re = "\\\\\\["; //train wreck for /\\\[/ assert ('\\'+"[").matches(re); A non-repeating escape is much easier on the eye. Choosing at random, I'll suggest <\ -> as a fattened escape sequence, with the standard ESL from the JLS (as amended with <\ space> etc). As long as that particular pair of characters is rare in REs (and other similar venues), there won't be any long trains of backslashes. var re = *"\\\["; assert ('\\'+"[").matches(re); var s6 = *"\-\- \-" \"; assert s6 == '\\'+"- \" "+'\\'; The star shows that I'm talking about some non-standard string syntax: FatEscString=SL[open=*", close=", escape=\-, esl=ESL, pc=none] I think it would be reasonable to fatten escapes as a separate feature, but not in tandem with the current multi-line string proposal. <digression> Straw man, separate from the MLS proposal. If a string literal (either fat or thing) is immediate preceded by <\ ->, the body of the string uses that sequence for its escapes instead of \. The ESL is unchanged. If stronger escapes are also desired, the feature can be extended simply by allowing any number of - characters, e.g. \--"x\-y\z" and \--"\--n" (for "x\\-y\\z" and "\n"). </digression> We are leaving \uXXXX escapes out of the accounting. This is understandable, because they are not a regular part of the ESL, and hard to treat as part of it. But we should try. In particular, we can and should find a way to treat most or all of the \uXXXX escapes *in a string body* as being expanded as part of the ESL, rather than a pre-pass. This will make \uXXXX escapes more complicated, but it may profitably simplify their effect on the user model. One idea is simple: In the body of a string, any \uXXXX which doesn't denote a controlling part of the string syntax (quote or backslash) is collected into the string body as an unexpanded character sequence <\ u X X X X>. This sequence is then supported by the ESL. The effect is that padding removal (rectangle extraction) happens before \u replacement *in a string body*. A second idea could be adopted either with the first or separately: As a structural constraint on string bodies, unicode sequences which would expand to whitespace, quote, or backslash are forbidden. And here's a draconian one: Forbid <\ u X X X X> where the code point is 007F or lower. That would blow up some stupid test cases and puzzlers; user code that does this should be fixed. If we can't do this everywhere, do it inside string bodies. We may be limited by backward compatibility on the application of these ideas to thin strings, but they should be considered at least for fat strings. There are two benefits to taming \uXXXX: 1. Fewer puzzlers involving hidden syntax (\ " etc.) 2. The processing of \uXXXX for string bodies can be documented and aligned with an "unescape" method on String, which is useful in its own right.