Re: Q: Lonely surrogates and unicode regexps

André Bargull Wed, 28 Jan 2015 08:03:46 -0800

On 1/28/2015 3:36 PM, Marja Hölttä wrote:

Based on Ex1, looks like the input string is not read as a sequence of code 
points when we try to
find a match for \1. So it's mostly read as a sequence of code points except 
when it's not. :/

Yep, back references are matched as a sequence of code units. The first link I've posted points tothe relevant method in java.util.regex.Pattern. I've got no idea why it's implemented that way, forexample when you enable case-insensitive matching, back references are no longer matched as asequence of code units:


---
int[] flags = { 0, Pattern.CASE_INSENSITIVE, Pattern.UNICODE_CASE,
        Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE };

// Prints true, false, true, false
Arrays.stream(flags).mapToObj(f -> Pattern.compile("foo(.+)bar\\1", f))
        .map(p -> p.matcher("foo\uD834bar\uD834\uDC00").find())
        .forEach(System.out::println);
---


On Wed, Jan 28, 2015 at 3:11 PM, André Bargull <[email protected]
<mailto:[email protected]>> wrote:

    On 1/28/2015 2:51 PM, André Bargull wrote:

            For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and 
openjdk
            1.7.0_65) Pattern.UNICODE_CHARACTER___CLASS works:

            foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
            generally, lonely surrogates match /./.

            Backreferences are allowed to consume the leading surrogate of a 
valid
            surrogate pair:

            Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1

            But surprisingly:

            Ex2: \uDC00foobar\uD834\__uDC00foobar\uD834 doesn't match ^(.+)\1$

            ... So Ex2 works as if the input string was converted to UTF-32 
before
            matching, but Ex1 works as if it was def not. Idk what's the 
correct mental
            model where both Ex1 and Ex2 would make sense.


        java.util.regex.Pattern matches back references by comparing (Java) 
chars [1], but reads
        patterns as a sequence of code points [2]. That should help to explain 
the differences
        between ex1 and ex2.

        [1]
        
http://hg.openjdk.java.net/__jdk8u/jdk8u/jdk/file/__c46daef6edb5/src/share/__classes/java/util/regex/__Pattern.java#l4890
        
<http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890>
        [2]
        
http://hg.openjdk.java.net/__jdk8u/jdk8u/jdk/file/__c46daef6edb5/src/share/__classes/java/util/regex/__Pattern.java#l1671
        
<http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671>


    Err, the part about how patterns are read is not important here. What I 
should have written is
    that the input string is (also) read as a sequence of code points [3]. So 
in ex2 `\uD834\uDC00`
    is read as a single code point (and not split into \uD834 and \uDC00 during 
backtracking).

    [3]
    
http://hg.openjdk.java.net/__jdk8u/jdk8u/jdk/file/__c46daef6edb5/src/share/__classes/java/util/regex/__Pattern.java#l3773
    
<http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l3773>

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Q: Lonely surrogates and unicode regexps

Reply via email to