On Apr 25, 2019, at 8:55 AM, Brian Goetz <brian.go...@oracle.com> wrote: > > A few more questions have been raised: > > - Do we do alignment before, or after, escape processing? > - What about single-line “fat” strings? > - What is the effect of text on the first line on alignment? > - What about opt-out? > - What about \<newline> ?
TL;DR: - Before, and watch out for \u00XX, - Disallow, - Disallow, - (see next) - Support \LineTerminator as both explicit layout control and opt-out (no \-). > Suggested answers: > > 1. Escape processing. If alignment is about removing _incidental > indentation_, it seems hard to believe that a \t escape is intended to be > incidental; this feels like payload, not envelope. Which suggests to me that > we should be doing alignment on the escaped string, and then doing escape > processing. I agree with this; I think it is much more intuitive to make sure that any escaped thing is classified as payload. Basically, if it doesn't look like whitespace, it won't be treated as the envelope of the rectangle but rather as payload inside the rectangle. What could be simpler? At first I thought this might make implementation and specification more complex, but actually it makes it simpler. Here's why: If you treat rectangle extraction as a process of grabbing a bunch of escape-sequence-laden payload, you can treat the expansion of escape sequences as a pure library function, a mapping from String to String, where LineTerminator shows up as \n (\u000A). I think Jim may already favor this approach? Making a clean separation between rectangle extraction (first) and escape sequence expansion (second) may also clarify the opt-out question; see below. One confounding factor we've hesitated to touch is the status of \uXXXX escapes, which look the same as \OOO escapes to most users but are completely different in order of processing. We could make our lives simpler with respect to \uXXXX escapes if we were to modify the rules for them inside of fat strings, so that (somehow) they were always interpreted as payload, and not as envelope. (We can't modify the rules inside of plain strings, sadly.) The JLS warns about \uXXXX escapes aliasing to surprising syntax characters, in 3.3 (\u005c = \), 3.10.4 (\u0027 = '), and 3.10.5 (\u0022 = ""). The net result is that you can obfuscate your Java program horribly if you use any of those unicode escapes. With the rectangle extraction feature of fat strings, the list grows to include \u0020 and other whitespace. As a matter of style programmers should scrupulously avoid unicode escapes for lexically significant code points. (Some puzzlers: What role can the unicode escape \u000A plan, in Java program text today? Hint: It's not a LineTerminator. What could it mean in a multi-line string? Same questions about \u000D? How should those characters interact with rectangle extraction?) At this point we could consider going farther, and make a mechanically checked guarantee against puzzlers in fat strings. I'm not sure about this, but I want to put a proposal out there FTR: Limitation on \uXXXX escapes: Inside of fat strings, any unicode escape sequence (which is necessarily of the form \u*XXXX repeated u followed by four hex digits) is forbidden to specify a hexidecimal number in the range of 0000 to 001F inclusive. (Reduced limitation: U-escapes must not alias to characters significant to the envelope, which are those in "\"\\ \t", quote+backslash+space+tab.) Effect: All remaining \uXXXX escapes are safe to retain during rectangle extraction and can be interpreted along with other string escapes in the same post-pass. In particular, a String library method can handle such escapes along with other C-like string escapes. Processing \u at the same time and in the same method as other escapes seems like a win to me, independently of the exclusion of puzzlers. This extra win made me speak up, in fact. Also, a coordinated limitation on fat string delimiters: The opening triple-quote of a fat string must not be derived from a unicode escape, which would have been of the form \u0022, \uu0022, etc. > For 2/3, here’s a radical suggestion. Our theory is, a “fat” string is one > that is is co-mingled with the indentation of the surrounding code, and one > which we usually wish the compiler to disentangle for us. By this > interpretation, fat single-line strings make no sense, so let’s ban them, and > similarly, text on the first line similarly makes little sense, so let’s ban > that too. In other words, fat strings (with the possible exception of the > trailing delimiter) must exist within a “Kevin Rectangle.” Yep. Put rectangle extraction front and center. Alternative theory for 2: Allow single-line fat strings. Perform analogous "line extraction" on them, by removing all unescaped whitespace after the open quote and before the close quote. This is like rectangle extraction, but in one dimension. > For 4 (opt out), I think it is OK to allow a self-stripping escape on the > first line (e.g., \-), which expands to nothing, but suppresses stripping. > This effectively becomes a “here doc”. I agree with the desire for a clear opt-out. Here's a question we should answer: When a user opts out of 2D layout with rectangle extraction, what should we call the alternative? Surely it comes with more intensive control from the user. Maybe that leads to odder-looking code, but maybe also it leads to code which the user has "beautified" in some way apart from rectangle extraction. I'd like to think of this opt-out scenario not just negatively ("don't auto-strip that white space") but positively ("I want to organize the form of my program more freely"). Not sure if that's possible, but read on. Any, I think an ad hoc escape \- at the front of the string is not such a clear win, and if we tweak the rules we can gain more than just a single dead-end quasi-escape. > For 5, since \<newline> is not valid today, we don’t have to decide this now, > we can add it later if desired. (This is more accurately called \LineTerminator, since escapes are processed after <newline> has been tokenized.) It's true we can defer this, but let's look at combining it with the opt-out feature and see if we like what we get. Thesis: The opt-out feature, which asks for all leading blanks (and bracketing newlines) to be retained is a special case of intensified user control over 2D program layout. Such intensified user control over 2D layout very often (in languages we all know, like makefiles and shell) often includes breaking of long lines, using escape sequences or other special control over the envelope (as opposed to payload). The user is taking more control over a complex payload, not just giving up on the rectangle rule. Proposal: Allow newlines to be marked (somehow) as non-payload, so users can have more intense control over program layout without "leaking" newlines used for layout into their payloads (string body characters). If we frame this feature as an escape sequence, which marks newlines for elision, then it can be rolled into the escape processing pass. If (see above) escape processing comes *after* rectangle extraction, then newline control could potentially co-exist with rectangle extraction, depending on the presence or absence of an opt-out condition. I think that could be a bonus, although that could be misused also. There are a range of possible rules for the opt-out from rectangle extraction, all with slightly different outcomes: - Opt out if the string body contains \LineTerminator anywhere. - Opt out if the string body contains \n or \r anywhere. - Opt out if the string body contains \n or \r or \LineTerminator anywhere. - Any of the previous rules, applied only between the open triple-quote and first LineTerminator. - Opt out if any visible character (not whitespace) occurs between the open triple-quote and first LineTerminator. - Allow any single escape sequence, possibly accompanied by whitespace, between the open triple-quote and first LineTerminator, and opt out if that occurs. (As you can see, the opt-out rule can be more or less specific, and can either co-exist with arbitrary "stuff" appearing after the open-quote, or with restrictions that allow only an opt-out to occur in the privileged position.) Specific proposal: The sequence \ LineTerminator followed by any amount of unescaped spaces and tabs is elided. This happens during escape processing, which means after rectangle extraction. Rectangle extraction is inhibited (opted out) by the presence of any escape sequence between the open triple-quote and the first following LineTerminator. Optionally: Other than whitespace and escape sequences, nothing is allowed between the open triple-quote and the first following LineTerminator. If rectangle extraction occurs, and escape processing encounters \ LineTerminator sequences, then additional leading whitespace is stripped. The escape sequence is ignorant of whether any leading whitespace (or none) was removed during rectangle extraction (if it occurred). Such two-step removal seems complicated but is easy to justify: The rectangle extraction isolates a visible block of source code from the containing context, and then the escape sequences do their work. If rectangle extraction is opted out of, the escape sequences would do the same work anyway. I think a set of decisions like this would hang together nicely and give users very good control over the layout of their programs. The resulting programs would (barring intentional obfuscation) read clearly, in both rectangular layouts and more ad hoc free-flowing formats.