Based on Ex1, looks like the input string is not read as a sequence of code points when we try to find a match for \1. So it's mostly read as a sequence of code points except when it's not. :/
On Wed, Jan 28, 2015 at 3:11 PM, André Bargull <andre.barg...@udo.edu> wrote: > On 1/28/2015 2:51 PM, André Bargull wrote: > >> For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk >>> 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works: >>> >>> foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so, >>> generally, lonely surrogates match /./. >>> >>> Backreferences are allowed to consume the leading surrogate of a valid >>> surrogate pair: >>> >>> Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1 >>> >>> But surprisingly: >>> >>> Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$ >>> >>> ... So Ex2 works as if the input string was converted to UTF-32 before >>> matching, but Ex1 works as if it was def not. Idk what's the correct >>> mental >>> model where both Ex1 and Ex2 would make sense. >>> >> >> java.util.regex.Pattern matches back references by comparing (Java) chars >> [1], but reads patterns as a sequence of code points [2]. That should help >> to explain the differences between ex1 and ex2. >> >> [1] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/ >> c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890 >> [2] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/ >> c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671 >> > > Err, the part about how patterns are read is not important here. What I > should have written is that the input string is (also) read as a sequence > of code points [3]. So in ex2 `\uD834\uDC00` is read as a single code point > (and not split into \uD834 and \uDC00 during backtracking). > > [3] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/ > c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l3773 >
_______________________________________________ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss