One feature common to many programming languages that Rust lacks is "raw" 
string literals. Specifically, these are string literals that don't interpret 
backslash-escapes. There are three obvious applications at the moment: regular 
expressions, windows file paths, and format!() strings that want to embed { and 
} chars. I'm sure there are more as well, such as large string literals that 
contain things like HTML text.

I took a look at 3 programming languages to see what solutions they had: D, 
C++11, and Python. I've reproduced their syntax below, plus one more custom 
syntax, along with pros & cons. I'm hoping we can come up with a syntax that 
makes sense for Rust.

## Python syntax:

Python supports an "r" or "R" prefix on any string literal (both "short" 
strings, delimited with a single quote, or "long" strings, delimited with 3 
quotes). The "r" or "R" prefix denotes a "raw string", and has the effect of 
disabling backslash-escapes within the string. For the most part. It actually 
gets a bit weird: if a sequence of backslashes of an odd length occurs prior to 
a quote (of the appropriate quote type for the string), then the quote is 
considered to be escaped, but the backslashes are left in the string. This 
means r"foo\"" evaluates to the string `foo\"`, and similarly r"foo\\\"" is 
`foo\\\"`, but r"foo\\" is merely the string `foo\\`.

Pros:
* Simple syntax
* Allows for embedding the closing quote character in the raw string

Cons:
* Handling of backslashes is very bizarre, and the closing quote character can 
only be embedded if you want to have a backslash before it.

## C++11 syntax:

C++11 allows for raw strings using a sequence of the form R"seq(raw text)seq". 
In this construct, `seq` is any sequence of (zero or more) characters except 
for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R"(raw 
text)", which allows for anything in the raw text except for the sequence `)"`. 
The addition of the delimiter sequence allows for constructing a raw string 
containing any sequence at all (as the delimiter sequence can be adjusted based 
on the represented text).

Pros:
* Allows for embedding any character at all (representable in the source file 
encoding), including the closing quote.
* Reasonably straightforward

Cons:
* Syntax is slightly complicated

## D syntax:

D supports three different forms of raw strings. The first two are similar, 
being r"raw text" and `raw text`. Besides the choice of delimiters, they behave 
identically, in that the raw text may contain anything except for the 
appropriate quote character. The third syntax is a slightly more complicated 
form of C++11's syntax, and is called a delimited string. It takes two forms.

The first looks like q"(raw text)" where the ( may be any non-identifier 
non-whitespace character. If the character is one of [(<{ then it is a "nesting 
delimiter", and the close delimiter must be the matching ])>} character, 
otherwise the close delimiter is the same as the open. Furthermore, nesting 
delimiters do exactly what their name says: they nest. If the nesting delimiter 
is (), then any ( in the raw text must be balanced with a ) in the raw text. In 
other words, q"(foo(bar))" evaluates to "foo(bar)", but q"(foo(bar)" and 
q"(foobar))" are both illegal.

The second uses any identifier as the delimiter. In this case, the identifier 
must immediately be followed by a newline, and in order to close the string, 
the close delimiter must be preceded by a newline. This looks like

q"delim
this is some raw text
delim"

It's essentially a heredoc. Note that the first newline is not part of the 
string, but the final newline is, so this evaluates to "this is some raw 
text\n".

Pros:
* Flexible
* Allows for constructing a raw string that contains any desired sequence of 
characters (representable in the source file's encoding)

Cons:
* Overly complicated

## Custom syntax

There's another approach that none of these three languages take, which is to 
merely allow for doubling up the quote character in order to embed a quote. 
This would look like R"raw string literal ""with embedded quotes"".", which 
becomes `raw string literal "with embedded quotes"`.

Pros:
* Very simple
* Allows for embedding the close quote character, and therefore, any character 
(representable in the source file encoding)

Cons:
* Slightly odd to read

## Conclusion

Of the three existing syntaxes examined here, I think C++11's is the best. It 
ties with D's syntax for being the most powerful, but is simpler than D's. The 
custom syntax is just as powerful though. The benefit of the C++11 syntax over 
the custom syntax is it's slightly easier to read the C++11 syntax, as the raw 
text has a 1-to-one mapping with the resulting string. The custom syntax is a 
bit more confusing to read, especially if you want to add multiple quotes. As a 
pathological case, let's try representing a Python triple-quoted docstring 
using both syntaxes:

C++11: R"("""this is a python docstring""")"
Custom: R"""""""this is a python docstring"""""""

Based on this examination, I'm leaning towards saying Rust should support 
C++11's raw string literal syntax.

I welcome any comments, criticisms, or suggestions.

-Kevin
_______________________________________________
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to