As Reinier pointed out on amber-dev, regex strings may routinely contain escaped meta-characters — +, *, brackets, etc. So the embedded \- and \+ story has an obvious conflict. While these are not the only possible characters for such “shift” operators, his point that this might be overkill is a good one. So let’s look at options for denoting raw-ness.
- Just make triple-quote strings always raw as well as multi-line-capable; regexes and friends would use TQ strings even though they are single line (Scala, Kotlin) - Letter prefix, such as R”…” (C++, Rust) - Symbol prefix, such as @“…” (C#), or \”…” (suggestive of “distributing” the escaping across the string.) - Embedded escape sequence that switches to raw mode, but can’t be switched back: “\+raw string”, “\{raw}raw string”. Data from Google suggests that, in their code base, on the order of 5% of candidates for multi-line strings use some escape sequences (Kevin/Liam, can you verify?) This suggests to me that the “just use TQ” approach is vaguely workable, but likely to be error-prone (5% is infrequently enough that people will say \t when they mean tab and discover this at runtime, and then have to go back and add a .escape() call.) (Of these, my current favorite is using the backslash: “cooked”, “””cooked and ML-capable”, \”raw”, \”””raw and ML capable”. The use of \ suggests “the backslashes have been pre-added for you”, building on existing associations with backslash.) Are there other credible candidates that I’ve missed? > On Jan 2, 2019, at 2:00 PM, Jim Laskey <james.las...@oracle.com> wrote: > > >> >> http://cr.openjdk.java.net/~jlaskey/Strings/RTL2/index.html >> <http://cr.openjdk.java.net/~jlaskey/Strings/RTL2/index.html> >> http://cr.openjdk.java.net/~jlaskey/Strings/RTL2.pdf >> <http://cr.openjdk.java.net/~jlaskey/Strings/RTL2.pdf> >> First of all, I would like to apologize for leading us down the garden path >> re Java Raw String Literals. I jumped into this feature fully enamoured with >> the JavaScript equivalent and, "why can't we have this in Java?" As the >> proposal evolved, it became clear that what we came up with was not a good >> Java solution. I underestimated the concern that the original proposal was >> too left field and did not fit into Java very well. It's somewhat ironic >> that the backtick looks like a thorn. >> >> So, let's start the new year with a structured approach to the enhance >> string literal design. Brian gave a summary of why the old design fails. >> Starting with this summary, Brian and I talked out a series of critical >> decision points that should be given thought, if not answers, before we >> propose a new design. As an exercise, I supplemented these points and >> created a series of small decision trees (a full on decision tree would be >> complex and not very helpful.) I found these trees good intuition pumps for >> getting the design at least 80% there. Hopefully, this exercise will help >> you in the same way. >> >> >> >> >> Even the label Raw String Literal put the emphasis on the wrong part of the >> feature. What developers really want is multi-line strings. They want to be >> able to paste alien source into their Java programs with as little fuss as >> possible. >> >> String raw-ness (not translating escapes) is a tangential aspect, that may >> or may not be needed to implement multi-line strings. Yes, the regex and >> Window's file path arguments in JEP 326 are still valid, but this aspect >> needs to be separated from the main part of the design. Further in the >> discussion, we'll see that raw-ness is really a many-headed hydra, best >> slain one head at a time. >> >> >> >> >> We have to be honest. We know Java's primary market. Sure we want to embed >> Java in Java for writing tests. Sure there is JavaScript and CSS in web >> pages. Nevertheless, most uses of multi-line will be for non-complex >> grammars. Specifically, grammars that don't require special handling of >> multi-character delimiter sequences. If you can accept this, then the >> solution set is much smaller. >> >> >> >> >> This is an easy one. Familiarity is key to feature education. Radical >> wandering off with new syntax is not helpful to anyone but bloggers and >> authors. >> >> >> >> >> If you buy into the familiarity argument, then double quote is really only >> choice for a delimiter. Double quote already indicates a string literal. >> Single quote indicates a character. We don’t want to gratuitously burn >> unused symbols like backtick. Backslash works for regex but maybe not for >> others. Combinations and nonces just introduce new noise when our original >> goal was to reduce noise and complexity. >> >> >> >> >> Other languages avoid delimiter escape sequences by doubling up. Example, >> "abc""def" -> abc"def. This concept is unfamiliar to Java developers, why >> change now. Escape sequences are what we know. >> >> >> >> >> Language designers got very nervous when I suggested infinite delimiter >> sequences in the original proposal; lexically sacrilegious. I felt strongly >> that it was easy to explain and only 1 in 1M developers would ever use more >> than 4-5 character delimiter sequences. In round two, I have come to agree. >> This was taking on more complexity than is really warranted, for a use case >> that doesn’t come along very often. I suggest we only need single and triple >> double quotes. A single double quote works today, so no argument there. >> Double double quotes means empty string, no problem. Triple double quotes >> are only necessary to avoid having to escape quotes in alien source. >> >> String json = """ >> { >> "name": "Jean Smith", >> "age": 32, >> "location": "San Jose" >> } >> """; >> >> versus >> >> String json = " >> { >> \"name\": \"Jean Smith\", >> \"age\": 32, >> \"location\": \"San Jose\" >> } >> "; >> >> This second case is where we wandered off the tracks with raw-ness. We >> assumed raw-ness is necessary to avoid all the backslashes. Most cases can >> be handled with triple double quotes. >> >> Okay, so why not more combinations? Simply because, most of the time they >> are not needed. On the rare occasion we do have nested triple double quotes, >> we can then use escape sequences. >> >> String nestedJSON = """ >> \"\"\" >> { >> "name": "Jean Smith", >> "age": 32, >> "location": "San Jose" >> } >> \"\"\"; >> """; >> >> or better yet, you only have to escape every third double quote >> >> String nestedJSON = """ >> \""" >> { >> "name": "Jean Smith", >> "age": 32, >> "location": "San Jose" >> } >> \"""; >> """; >> >> Not so evil and it's familiar. >> >> >> >> >> Meaning, you can only use single quotes for simple strings and triple quotes >> for multi-line strings. I don't have a strong opinion other than it seems >> like an unneeded restriction. The only argument I've heard has been for >> better error recovery when missing a close delimiter during parsing. My >> counter for that argument is that if you are processing multi-line strings >> then you can easily track the first newline after the opening delimiter and >> recover from there. I implemented that recovery in javac and worked out well. >> >> >> >> >> >> Cooked (translated escape sequences) should be the default. Why should a >> multi-line string be different than a simple string? We have a solution for >> embedding double quote. Single quotes don't require escaping. Tabs and >> newlines can exist as is. Unicode characters can be either an escape >> sequence or the unicode character. So the only problem case is backslash. I >> would argue that the rare backslash can be escaped. If not, then the >> developer can use the raw-ness solution. >> >> >> >> >> If we don't translate newlines, then source is not transferable across >> platforms. That is, a source from one platform may not execute the same way >> on another platform. Translating consistently guarantees execution >> consistency. As a note, programming languages that didn't translate newlines >> in multi-line string literals typically regretted it later (Python.) >> >> >> >> >> With the original Raw String Literal proposal, there was concern about >> leading and trailing nested delimiters. If we default to cooked strings, >> then we use can use \". >> >> >> >> >> These questions have been answered numerous times and fall into the realm of >> library support. Same arguments as before, same outcome. >> >> >> To summarize the bold paths at this point; >> - multi-line strings are an extension of traditional simple strings >> - newlines in a string are no longer an error and the string can extend >> across several lines >> - error recovery can pick up at the first newline after the opening >> delimiter >> - multi-line strings process escape sequences (including unicode) in >> the same way as simple strings >> - multiple double quotes are handled with escape sequences >> - triple double quote delimiter is introduced to avoid escaping simple >> double quote sequences >> >> Generally, I think this is very much in the traditional Java spirit. >> >> >> Now, let's move on to the lesser but more interesting issue. As I stated >> above, raw-ness is a multi-headed beast. Raw-ness involves the turning off >> the translation of >> - escape sequences >> - unicode escapes >> - delimiter sequences >> - escape sequence prefix (backslash) >> - tabs and newlines (control characters in general) >> >> Sometimes we need all of the translations, sometimes few and sometimes none. >> In the multi-line discussion above, we see we don't need raw as much as we >> might have expected. Maybe for occasional backslashes, as in regex and >> Windows paths strings. >> >> >> >> >> >> The original Raw String Literal proposal suggested that raw-ness was a >> property of the whole string literal and thus we proposed an alternate >> delimiter syntax just to emphasize that fact. If we accept the bold path of >> multi-line discussion above, then alternate delimiter is out. This leaves >> prefixing as the best option to bless a string literal with raw-ness. >> >> At this point, I would like to suggest an alternate, maybe progressive way >> to think of raw-ness. Since the original proposal, I have been thinking of >> raw-ness as a state of processing the literal. State is certainly obvious in >> the scanner implementation, why not raise that to the language level? If it >> is a state then we should be able to enter and leave that state in some way. >> Escape sequences are an obvious way of transitioning translation in the >> string. \- and \+ are available and not currently recognized as valid escape >> sequences, why not \- and \+ to toggle escape processing? >> >> String a = "cooked \-raw\+ cooked"; // cooked raw cooked - a little odd >> but not so much so >> String b = "abc\-\\\\\+def"; // abc\\\\def - struggling >> String c = "\-abc\\\\def"; // abc\\\\def - more readable as an >> inner prefix >> String d = "abc\-\-def\+\+ghi"; // abc\-def\+ghi - raw on "\-" is >> "\" and "-", raw off "\+" is "\" and "+" >> String e = """\-"abc"\+"""; // "abc" - \- and \+ act a no-ops of >> sorts >> >> Comparing property vs state: >> >> Runtime.getRuntime().exec(R""" "C:\Program Files\foo" bar""".strip()); >> Runtime.getRuntime().exec("""\-"C:\Program Files\foo" bar"""); >> >> System.out.println("this".matches(R"\w\w\w\w")); >> System.out.println("this".matches("\-\w\w\w\w")); >> >> String html = R""" >> <html> >> <body> >> <p>Hello World.</p> >> </body> >> </html> >> """.align(); >> String html = """\- >> <html> >> <body> >> <p>Hello World.</p> >> </body> >> </html> >> """.align(); >> >> >> String nested = """ >> String EXAMPLE_TEST = "This is my small example " >> + "string which I'm going to " >> + "use for pattern matching."; >> """ + >> R""" >> System.out.println(EXAMPLE_TEST.replaceAll("\\s+", >> "\t")); >> """; >> String nested = """ >> String EXAMPLE_TEST = "This is my small example " >> + "string which I'm going to " >> + "use for pattern matching."; >> \- >> System.out.println(EXAMPLE_TEST.replaceAll("\\s+", >> "\t")); >> \+ >> """; >> >> Hopefully, this is a good starting point for discussion. As before, I'm >> pragmatic about which direction we go, so feel free to comment. >> >> Cheers, >> >> -- Jim >> >> >> >> >> >> >> >> >> >>