Re: Raw String Literal Library Support

Jim Laskey Tue, 20 Mar 2018 06:38:02 -0700

Summary.

A. Line support.


- Supporting a mix of line terminators `\n|\r\n|\r` is already a well 
established pattern in language parsers, in the JDK (ex. see  
java.nio.file.FileChannelLinesSpliterator) and RegEx (ex. see `\R`). The 
performance difference between checking one vs the three is negligible.

- Yes, Stream<String> stream = 
Pattern.compile("\n|\r\n|\r").splitAsStream(string); is very useful 
(Spliterators rule), but is cumbersome in this expected to be common use case. 
Only so-so streamy. :-)

- BufferedRead.lines() vs. String.lines() is a tricky discussion. It comes down 
to whether the new line is a terminator or a separator.  In the i/o case, it 
seems terminator is the right answer. A well formed text file will have a new 
line at the end of every line.  However, I think you’ll find when people work 
with multi-line strings they think of new line as a separator. Hence, the 
common use of split(“\n”) and “”.split(“\n”).length == 1. Indentation, the 
position of closing delimiter and margin trimming makes that last line very 
fluid.

What clinches the deal is that  
string.lines().collect(joining(“\n”)).equals(string). I’ll ensure both versions 
of lines() have the difference well javadocumented.

- The current Spliterator implementation makes 
String.lines().toArray(String[]::new) an order of magnitude faster than 
split(`\n|\r\n|\r`). That’s why I implemented it for margin management. Faster 
still if no collection/array is constructed.

BTW: split(`\R`) is 2x-3x faster than split(`\n|\r\n|\r`). Nice.

B. Additions to basic trim methods.

- Revamped to become strip, stripLeading, stripTrailing using 
Character.isWhiteSpace(codepoint) as the test (optimized using ch == ‘ ' || ch 
== ‘\t’ || Character.isWhiteSpace(ch)).

- No strong feeling about it, but String.trim() could be recommended for 
deprecation.

C. Margin management.

- String.trimMarkers() as a default to String.trimMarkers(“|”, “|”) is 
reasonable.  Will put it in the CSR for broader discussion.

- Re use of patterns. I think the Stream<String> lines() method will make it 
very easy enough to create custom trim margin lambdas.

D. Escape management.

- Good

Cheers,

— Jim




> On Mar 13, 2018, at 10:47 AM, Jim Laskey <james.las...@oracle.com> wrote:
> 
> With the announcement of JEP 326 Raw String Literals, we would like to open 
> up a discussion with regards to RSL library support. Below are several 
> implemented String methods that are believed to be appropriate. Please 
> comment on those mentioned below including recommending alternate names or 
> signatures. Additional methods can be considered if warranted, but as always, 
> the bar for inclusion in String is high.
> 
> You should keep a couple things in mind when reviewing these methods.
> 
> Methods should be applicable to all strings, not just Raw String Literals.
> 
> The number of additional methods should be minimized, not adding every 
> possible method.
> 
> Don't put any emphasis on performance. That is a separate discussion.
> 
> Cheers,
> 
> -- Jim
> 
> A. Line support.
> 
> public Stream<String> lines()
> Returns a stream of substrings extracted from this string partitioned by line 
> terminators. Internally, the stream is implemented using a Spliteratorthat 
> extracts one line at a time. The line terminators recognized are \n, \r\n and 
> \r. This method provides versatility for the developer working with 
> multi-line strings.
>     Example:
> 
>        String string = "abc\ndef\nghi";
>        Stream<String> stream = string.lines();
>        List<String> list = stream.collect(Collectors.toList());
> 
>     Result:
> 
>     [abc, def, ghi]
> 
> 
>     Example:
> 
>        String string = "abc\ndef\nghi";
>        String[] array = string.lines().toArray(String[]::new);
> 
>     Result:
> 
>     [Ljava.lang.String;@33e5ccce // [abc, def, ghi]
> 
> 
>     Example:
> 
>        String string = "abc\ndef\r\nghi\rjkl";
>        String platformString =
>            string.lines().collect(joining(System.lineSeparator()));
> 
>     Result:
> 
>     abc
>     def
>     ghi
>     jkl
> 
> 
>     Example:
> 
>        String string = " abc  \n   def  \n ghi   ";
>        String trimmedString =
>             string.lines().map(s -> s.trim()).collect(joining("\n"));
> 
>     Result:
> 
>     abc
>     def
>     ghi
> 
> 
>     Example:
> 
>        String table = `First Name      Surname        Phone
>                        Al              Albert         555-1111
>                        Bob             Roberts        555-2222
>                        Cal             Calvin         555-3333
>                       `;
> 
>        // Extract headers
>        String firstLine = table.lines().findFirst().orElse("");
>        List<String> headings = List.of(firstLine.trim().split(`\s{2,}`));
> 
>        // Build stream of maps
>        Stream<Map<String, String>> stream =
>            table.lines().skip(1)
>                 .map(line -> line.trim())
>                 .filter(line -> !line.isEmpty())
>                 .map(line -> line.split(`\s{2,}`))
>                 .map(columns -> {
>                     List<String> values = List.of(columns);
>                     return IntStream.range(0, headings.size()).boxed()
>                                     .collect(toMap(headings::get, 
> values::get));
>                 });
> 
>        // print all "First Name"
>        stream.map(row -> row.get("First Name"))
>              .forEach(name -> System.out.println(name));
> 
>     Result:
> 
>     Al
>     Bob
>     Cal
> B. Additions to basic trim methods. In addition to margin methods trimIndent 
> and trimMarkers described below in Section C, it would be worth introducing 
> trimLeft and trimRight to augment the longstanding trim method. A key 
> question is how trimLeft and trimRight should detect whitespace, because 
> different definitions of whitespace exist in the library. 
> 
> trim itself uses the simple test less than or equal to the space character, a 
> fast test but not Unicode friendly. 
> 
> Character.isWhitespace(codepoint) returns true if codepoint one of the 
> following;
> 
>   SPACE_SEPARATOR.
>   LINE_SEPARATOR.
>   PARAGRAPH_SEPARATOR.
>   '\t',     U+0009 HORIZONTAL TABULATION.
>   '\n',     U+000A LINE FEED.
>   '\u000B', U+000B VERTICAL TABULATION.
>   '\f',     U+000C FORM FEED.
>   '\r',     U+000D CARRIAGE RETURN.
>   '\u001C', U+001C FILE SEPARATOR.
>   '\u001D', U+001D GROUP SEPARATOR.
>   '\u001E', U+001E RECORD SEPARATOR.
>   '\u001F', U+001F UNIT SEPARATOR.
>   ' ',      U+0020 SPACE.
> (Note: that non-breaking space (\u00A0) is excluded) 
> 
> Character.isSpaceChar(codepoint) returns true if codepoint one of the 
> following;
> 
>   SPACE_SEPARATOR.
>   LINE_SEPARATOR.
>   PARAGRAPH_SEPARATOR.
>   ' ',      U+0020 SPACE.
>   '\u00A0', U+00A0 NON-BREAKING SPACE.
> That sets up several kinds of whitespace; trim's whitespace (TWS), Character 
> whitespace (CWS) and the union of the two (UWS). TWS is a fast test. CWS is a 
> slow test. UWS is fast for Latin1 and slow-ish for UTF-16. 
> 
> We are recommending that trimLeft and trimRight use UWS, leave trim alone to 
> avoid breaking the world and then possibly introduce trimWhitespace that uses 
> UWS.
> 
> public String trim() 
> Removes characters less than equal to space from the beginning and end of the 
> string. No, change except spec clarification and links to the new trim 
> methods.
>    Examples:
>        "".trim();              // ""
>        "   ".trim();           // ""
>        "  abc  ".trim();       // "abc"
>        "  \u2028abc  ".trim(); // "\u2028abc"
> public String trimWhitespace() 
> Removes whitespace from the beginning and end of the string.
>     Examples:
> 
>        "".trimWhitespace();              // ""
>        "   ".trimWhitespace();           // ""
>        "  abc  ".trimWhitespace();       // "abc"
>        "  \u2028abc  ".trimWhitespace(); // "abc"
> public String trimLeft()
> Removes whitespace from the beginning of the string.
>     Examples:
> 
>        "".trimLeft();        // ""
>        "   ".trimLeft();     // ""
>        "  abc  ".trimLeft(); // "abc  "
> public String trimRight()
> Removes whitespace from the end of the string.
>     Examples:
> 
>        "".trimRight();        // ""
>        "   ".trimRight();     // ""
>        "  abc  ".trimRight(); // "  abc"
> C. Margin management. With introduction of multi-line Raw String Literals, 
> developers will have to deal with the extraneous spacing introduced by 
> indenting and formatting string bodies. 
> 
> Note that for all the methods in this group, if the first line is empty then 
> it is removed and if the last is empty then it is removed. This removal 
> provides a means for developers that use delimiters on separate lines to 
> bracket string bodies. Also note, that all line separators are replaced with 
> \n.
> 
> public String trimIndent()
> This method determines a representative line in the string body that has a 
> non-whitespace character closest to the left margin. Once that line has been 
> determined, the number of leading whitespaces is tallied to produce a minimal 
> indent amount. Consequently, the result of the method is a string with the 
> minimal indent amount removed from each line. The first line is unaffected 
> since it is preceded by the open delimiter. The type of whitespace used 
> (spaces or tabs) does not affect the result as long as the developer is 
> consistent with the whitespace used.
>     Example:
> 
>        String x = `
>                   This is a line
>                      This is a line
>                          This is a line
>                      This is a line
>                   This is a line
>                   `.trimIndent();
> 
>     Result:
> 
>     This is a line
>         This is a line
>             This is a line
>         This is a line
>     This is a line
> public String trimMarkers(String leftMarker, String rightMarker)
> Each line of the multi-line string is first trimmed. If the trimmed line 
> contains the leftMarker at the beginning of the string then it is removed. 
> Finally, if the line contains the rightMarker at the end of line, it is 
> removed.
>     Example:
> 
>         String x = `|This is a line|
>                     |This is a line|
>                     |This is a line|`.trimMarkers("|", "|");
>     Result:
> 
>     This is a line
>     This is a line
>     This is a line
> 
>     Example:
> 
>         String x = `>> This is a line
>>> This is a line
>>> This is a line`.trimMarkers(">> ", "");
>     Result:
> 
>     This is a line
>     This is a line
>     This is a line
> D. Escape management. Since Raw String Literals do not interpret Unicode 
> escapes (\unnnn) or escape sequences (\n, \b, etc), we need to provide a 
> scheme for developers who just want multi-line strings but still have escape 
> sequences interpreted.
> 
> public String unescape() throws MalformedEscapeException
> Translates each Unicode escape or escape sequence in the string into the 
> character represented by the escape. @jls 3.3, 3.10.6
>     Example:
> 
>         `abc\u2022def\nghi`.unescape();
> 
>     Result:
> 
>     abc•def
>     ghi
> public String unescape(EscapeType... escape) throws MalformedEscapeException
> Selectively translates Unicode escape or escape sequence based on the escape 
> type flags provided.
>       public enum EscapeType {
>            /** Backslash escape sequences based on section 3.10.6 of the
>             * <cite>The Java&trade; Language Specification</cite>.
>             * This includes sequences for backspace, horizontal tab,
>             * line feed, form feed, carriage return, double quote,
>             * single quote, backslash and octal escape sequences.
>             */
>            BACKSLASH, //
> 
>            /** Unicode sequences based on section 3.3 of the
>             * <cite>The Java&trade; Language Specification</cite>.
>             * This includes sequences in the form {@code \u005Cunnnn}.
>             */
>            UNICODE
>        }
> 
> 
>     Example:
> 
>         `abc\u2022def\nghi`.unescape(EscapeType.BACKSLASH);
> 
>     Result:
> 
>     abc\u2022def
>     ghi
> 
> 
>     Example:
> 
>         `abc\u2022def\nghi`.unescape(EscapeType.UNICODE);
> 
>     Result:
> 
>     abc•def\nghi
> Conversely, there are circumstances where the inverse is required
> 
> public String escape()
> Translates each quote, backslash, non-graphic character or non-ASCII 
> character into an Unicode escape or escape sequence. The method is equivalent 
> to escape(BACKSLASH, UNICODE) .
>     Example:
> 
>         `abc•def
>         ghi`.escape();
> 
>     Result:
> 
>     abc\u2022def\nghi
> public String escape(EscapeType... escape)
> Selectively translates each quote, backslash, non-graphic character or 
> non-ASCII character into an Unicode escape or escape sequence based on the 
> escape type flags provided.
>     Example:
> 
>         `abc•def
>         ghi`.escape(EscapeType.BACKSLASH);
> 
>     Result:
> 
>     abc•def\nghi
> 
> 
>     Example:
> 
>         `abc•def
>         ghi`.escape(EscapeType.UNICODE);
> 
>     Result:
> 
>     abc\u2022def
>     ghi
>

Re: Raw String Literal Library Support

Reply via email to