Based on Ex1, looks like the input string is not read as a sequence of code
points when we try to find a match for \1. So it's mostly read as a
sequence of code points except when it's not. :/

On Wed, Jan 28, 2015 at 3:11 PM, André Bargull <andre.barg...@udo.edu>
wrote:

> On 1/28/2015 2:51 PM, André Bargull wrote:
>
>> For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk
>>> 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:
>>>
>>> foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
>>> generally, lonely surrogates match /./.
>>>
>>> Backreferences are allowed to consume the leading surrogate of a valid
>>> surrogate pair:
>>>
>>> Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1
>>>
>>> But surprisingly:
>>>
>>> Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$
>>>
>>> ... So Ex2 works as if the input string was converted to UTF-32 before
>>> matching, but Ex1 works as if it was def not. Idk what's the correct
>>> mental
>>> model where both Ex1 and Ex2 would make sense.
>>>
>>
>> java.util.regex.Pattern matches back references by comparing (Java) chars
>> [1], but reads patterns as a sequence of code points [2]. That should help
>> to explain the differences between ex1 and ex2.
>>
>> [1] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/
>> c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890
>> [2] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/
>> c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671
>>
>
> Err, the part about how patterns are read is not important here. What I
> should have written is that the input string is (also) read as a sequence
> of code points [3]. So in ex2 `\uD834\uDC00` is read as a single code point
> (and not split into \uD834 and \uDC00 during backtracking).
>
> [3] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/
> c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l3773
>
_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to