Re: RL1.1 Hex Notation

Xueming Shen Thu, 27 Jan 2011 15:16:03 -0800

I run

    public static void main(String[] args) {


        test("\uD800\uDF3C", "^\\x{1033c}$");
        test("\uD800\uDF3C", "^\\xF0\\x90\\x8C\\xBC$");
        test("\uD800\uDF3C", "^\\x{D800}\\x{DF3c}+$");
        test("\uD800\uDF3C", "^[\\x{D800}\\x{DF3c}]+$");
        test("\uD800\uDF3C", "^\\xF0\\x90\\x8C\\xBC$");
        test("\uD800\uDF3C", "^[\\xF0\\x90\\x8C\\xBC]+$");
        test("\uD800\uDF3C", "^[\\x{D800}\\x{DF3C}]+$");
        test("\uD800\uDF3C", "^[\\x{DF3C}\\x{D800}]+$");
        test("\uDF3C\uD800", "^[\\x{D800}\\x{DF3C}]+$");
        test("\uDF3C\uD800", "^[\\x{DF3C}\\x{D800}]+$");

    }

    static void test(String text, String pattern) {
        System.out.println(Pattern.matches(pattern, text));
    }

It yields

true
false
false
false
false
false
false
false
true
true

The difference is at

        test("\uD800\uDF3C", "^[\\x{D800}\\x{DF3C}]+$");
        test("\uD800\uDF3C", "^[\\x{DF3C}\\x{D800}]+$");

You can have unpaird surrogate in Java String, but if you have a paired one
you can't say I want them to be two separated "unpaired" surrogates.

Pretty close, right? sure you would need the \x{...} patch:-) as I'mpreparing it athttp://cr.openjdk.java.net/~sherman/7014645/<http://cr.openjdk.java.net/%7Esherman/7014645/>

Yes,the [\\uhhhh\\ullll] pair inside class is tricky, the implementationcan't tell if you

want paired or unpaired,  the current implementation treats them as a paired
surrogates -> a supplementary character. An alternative is to write them as

union [[\\uhhhh][\\ullll]\\uhhhh\\ullll], if you also want to match the"unpaired"

surrogate in a string

for example

        test("\uD800\uDF3C", "^[[\\uD800][\\uDF3C]\\uD800\\uDF3C]+$");
        test("\uDF3C\uD800", "^[[\\uD800][\\uDF3C]\\uD800\\uDF3C]+$");

I assume if I can have this \x{...} in (7), we all agree we are donewith RL1.1?:-)


-Sherman

On 01/27/2011 12:48 PM, Tom Christiansen wrote:

on 7 with the following output. I modified your test case "slightly"
since it appears the UnicodeSet class in our normalizer package does
not have the size(), replace it with a normal hashset.

Does that mean the following now works?

     1. a+b matches "[" + a + b + "]+"
     2. b+a matches "[" + a + b + "]+"
     3. a+b matches "[" + b + a + "]+"
     4. b+a matches "[" + b + a + "]+"

When a and b take on every Unicode code point, meaning
from U+0000 up to  U+10FFFF?  If they do not, then one
is not specifying Unicode code points.

Please correct me if I am wrong, but I believe the following
code showing how logical code points are *never* mistaken with
their serialization representations is conforming behaviour--and
that results other than these would indicate nonconforming behavior:

     $ perl -le 'print "\x{1033c}" =~ /^\x{1033c}$/ || 0'
     1
     $ perl -le 'print "\xF0\x90\x8C\xBC" =~ /^\x{1033c}$/ || 0'
     0
     $ perl -le 'print "\x{D800}\x{DF3C}" =~ /^\xF0\x90\x8C\xBC$/ || 0'
     0

     $ perl -le 'print "\x{1033c}" =~ /^\x{D800}\x{DF3c}+$/ || 0'
     0
     $ perl -le 'print "\x{1033c}" =~ /^[\x{D800}\x{DF3c}]+$/ || 0'
     0
     $ perl -le 'print "\x{1033c}" =~ /^\xF0\x90\x8C\xBC$/ || 0'
     0
     $ perl -le 'print "\x{1033c}" =~ /^[\xF0\x90\x8C\xBC]+$/ || 0'
     0

     $ perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0'
     1
     $ perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0'
     1
     $ perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0'
     1
     $ perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0'
     1

Can Java do that yet?  If not, then \uXXXX does not meet RL1.1, and one
appears to need \x{} or its equivalent to do so--with the proviso from
the top of this message that it must not be double evaluated for meta
characters: \x{} must always be a literal code point of that number
without regard to reinterpretation as UTF-16 or as pattern syntax.

I'm sorry if this is too terse.  I do not mean to be in the least bit
confrontational!  I apologize in advance if sounds that way; I really do
not intend it.  It is possible that I have a different way of looking at
regexes than Java folks have historically considered them.  Even if so,
I believe my way of looking at them accords with tr18's RL1.1 in both
its letter and its spirit, and that Java's current way fails to meet
that requirement in either sense.

--tom

     #!/bin/sh
     # expected results: 1 0 0 0 0 0 0 1 1 1 1
     perl -le 'print "\x{1033c}" =~ /^\x{1033c}$/ || 0'
     perl -le 'print "\xF0\x90\x8C\xBC" =~ /^\x{1033c}$/ || 0'
     perl -le 'print "\x{D800}\x{DF3C}" =~ /^\xF0\x90\x8C\xBC$/ || 0'
     perl -le 'print "\x{1033c}" =~ /^\x{D800}\x{DF3c}+$/ || 0'
     perl -le 'print "\x{1033c}" =~ /^[\x{D800}\x{DF3c}]+$/ || 0'
     perl -le 'print "\x{1033c}" =~ /^\xF0\x90\x8C\xBC$/ || 0'
     perl -le 'print "\x{1033c}" =~ /^[\xF0\x90\x8C\xBC]+$/ || 0'
     perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0'
     perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0'
     perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0'
     perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0'

Re: RL1.1 Hex Notation

Reply via email to