I think the cleanest mental model is where UTF-16 or UTF-8 strings are interpreted as if they were transformed into UTF-32.
While that is generally feasible, it often represents a cost in performance which is not acceptable in practice. So you see various approaches that involve some deviation from that mental model. Mark <https://google.com/+MarkDavis> *— Il meglio è l’inimico del bene —* On Wed, Jan 28, 2015 at 2:15 PM, Marja Hölttä <ma...@chromium.org> wrote: > For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk > 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works: > > foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so, > generally, lonely surrogates match /./. > > Backreferences are allowed to consume the leading surrogate of a valid > surrogate pair: > > Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1 > > But surprisingly: > > Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$ > > ... So Ex2 works as if the input string was converted to UTF-32 before > matching, but Ex1 works as if it was def not. Idk what's the correct mental > model where both Ex1 and Ex2 would make sense. > >
_______________________________________________ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss