On 01/26/2011 11:50 AM, Mark Davis ☕ wrote:
> I guess you are asking for something like?

I'm not asking for that. What I'm saying is that as far as I can tell, there is no way in Java to meet the terms of RL1.1, because there is not a way to use hex numbers in any syntax for values above FFFF to indicate literals. That is, if you supply "abc\\uD800\\uDC00def" then regex fails.

The code was my attempt to try to get something to work even using separate surrogates (which was not the intent of RL1.1), but even that failed. Maybe there is another way to do it?

Mark
//

Oh, I see the problem. Obviously I have been working on jdk7 too long and forgot the latest release is still 6:-( There is indeed a bug in the previous implementation which I fixed in 7 long time ago (I mentioned this in one of the early emails but was not specific, my apology), probably should backport to 6 update release asap. The test case runs well (the "failures" in literals are expected) on 7 with the following output. I modified your test case "slightly" since it appears the UnicodeSet class in our normalizer package does not
have the size(), replace it with a normal hashset.

-Sherman

------------------------------------------------------------------
LITERALS Failures: 18
set: [9, 10, 11, 12, 13, 32, 35, 36, 40, 41, 42, 43, 63, 91, 92, 94, 123, 124]
    example1: a    b
    exampleN: a|b
INLINE Failures: 0
    set: []
    example1: null
    exampleN: null
INRANGE Failures: 0
    set: []
    example1: null
    exampleN: null

-----------------------------------------------------------------------
import java.util.regex.*;
import java.util.*;
import sun.text.normalizer.*;

public class TestRegex2 {

   public static void main(String[] args) {

        System.out.println("Check patterns for Unicodeset");

        for (int i = 0; i <= 0x10FFFF; ++i) {
// The goal is to make a regex with hex digits, and have it match the corresponding character // We check two different environments: inline ("aXb") and in a range ("a[X]b")


            String s = new StringBuilder().appendCodePoint(i).toString();
            String hexPattern = i <= 0xFFFF ? "\\u" + Utility.hex(i,4)
: "\\u" + Utility.hex(Character.toChars(i)[0],4) + "\\u" + Utility.hex(Character.toChars(i)[1],4);

            String target = "a" + s + "b";

            Failures.LITERALS.checkMatch(i, "a" + s + "b", target);
            Failures.INLINE.checkMatch(i, "a" + hexPattern + "b", target);
Failures.INRANGE.checkMatch(i, "a[" + hexPattern + "]b", target);
        }

        Failures.LITERALS.showFailures();
        Failures.INLINE.showFailures();
        Failures.INRANGE.showFailures();
    }


    static enum Failures {

        LITERALS, INLINE, INRANGE;

        Set<Integer> failureSet = new LinkedHashSet<Integer>();
        String firstSampleFailure;
        String lastSampleFailure;

        void checkMatch(int codePoint, String pattern, String target) {

            if (!matches(pattern, target)) {
                failureSet.add(codePoint);
                if (firstSampleFailure == null) {
                    firstSampleFailure = pattern;
                }
                lastSampleFailure = pattern;
            }
        }

        boolean matches(String hexPattern, String target) {
            try {
                // use COMMENTS to get the 'worst case'
return Pattern.compile(hexPattern, Pattern.COMMENTS).matcher(target).matches();
            } catch (Exception e) {
                return false;
            }
        }

        void showFailures() {
System.out.format(this + " Failures: %s\n\tset: %s\n\texample1: %s\n\texampleN: %s\n", failureSet.size(), failureSet, firstSampleFailure, lastSampleFailure); }

    }

Reply via email to