Re: RL1.1 Hex Notation

Xueming Shen Tue, 25 Jan 2011 17:49:43 -0800

Hi Mark,

I guess you are asking for something like?


        char[] cc = Character.toChars(0x12345);
        Matcher m = Pattern.compile("["
                                     + "\\u" + HEX(cc[0])
                                     + "\\u" + HEX(cc[1])
                                     + "]").matcher("");

System.out.println("find=" + m.reset("abc[" + new String(cc) +"]efg").find());


in which the HEX should be something like below to make it a nnnn.

    static String HEX(char c) {
        StringBuilder sb = new StringBuilder();
        Formatter fm = new Formatter(sb);
        fm.format("%04x", (int)c);
        return sb.toString();
    }

It looks a little tedious, you will probably also have to differentiatebmp or supplementaryto decide to feed in one utf16 hex or a pair, just to show you can stilluse Java Unicodeescape to embed the hex values of the utf16 instead of the "characteritself". Does

it qualify for the RL1.1?

Sure, \x{...} looks more straightforward and convenient. As I said inprevious email

exchange I totally agree it will be a nice enhancement for Java RegEx.

-Sherman

On 1-25-2011 17:00 05:00 PM, Mark Davis ☕ wrote:

The goal of the clause is to have a mechanism for using hex values forcharacter literals. That is, you should be able to take a code pointfrom 0 to 10FFFF, get a hex value for that, embed it in some syntax,and concatenate it into a pattern, and have it work as a literal.
For example:

    String pattern = first_part + "\\x{" + hex(myCodePoint) + "}" +
    second_part; // for *some* hex notation
    ...
    Matcher m = Pattern.compile(pattern,
    Pattern.COMMENTS).matcher(target);
    ...
As far as I can tell, Java really doesn't supply that capability fornon-BMP, because the \u notation doesn't work above FFFF, exceptinsofar as the preprocessor maps a surrogate pair in hex to literals,which happen all to work because they aren't syntax characters.
What you can do with Java is:

   1. embed the character itself, not the hex representation, which
      works some of the time (fails for 18 characters; syntax
      characters, as expected).
   2. in constant expressions only, utilize the Java preprocessor with
      \u.... or \u....\u....).
   3. for BMP characters, use "\u" + hex(myCodePoint,4)

Here is a quick and dirty test; let me know if I've missed something.

*Output:*

LITERALS Failures: 18

        set: [\u0009-\u000D\ #\$(-+?\[\\\^\{|]

        example1: ab

        exampleN: a|b

INLINE Failures: 1048576

        set: [\U00010000-\U0010FFFF]

        example1: a\uD800\uDC00b

        exampleN: a\uDBFF\uDFFFb

INRANGE Failures: 1048576

        set: [\U00010000-\U0010FFFF]

        example1: a[\uD800\uDC00]b

        exampleN: a[\uDBFF\uDFFF]b


*Code:*

public void TestRegex() {

        logln("Check patterns for Unicodeset");


for (int i = 0; i <= 0x10FFFF; ++i) {
// The goal is to make a regex with hex digits, and have it match thecorresponding character
// We check two different environments: inline ("aXb") and in a range("a[X]b")
            String s = new StringBuilder().appendCodePoint(i).toString();


            String hexPattern = i <= 0xFFFF ? "\\u" + Utility.hex(i,4)
: "\\u" + Utility.hex(Character.toChars(i)[0],4) +"\\u" + Utility.hex(Character.toChars(i)[1],4);
            String target = "a" + s + "b";


            Failures.LITERALS.checkMatch(i, "a" + s + "b", target);

            Failures.INLINE.checkMatch(i, "a" + hexPattern + "b", target);
Failures.INRANGE.checkMatch(i, "a[" + hexPattern + "]b",target);
        }

        Failures.LITERALS.showFailures();

        Failures.INLINE.showFailures();

        Failures.INRANGE.showFailures();

    }


enum Failures {

LITERALS, INLINE, INRANGE;

        UnicodeSet failureSet = new UnicodeSet();

        String firstSampleFailure;

        String lastSampleFailure;


void checkMatch(int codePoint, String pattern, String target) {

if (!matches(pattern, target)) {

failureSet.add(codePoint);

if (firstSampleFailure == null) {

firstSampleFailure = pattern;

                }

lastSampleFailure = pattern;

            }

        }

boolean matches(String hexPattern, String target) {

try {

// use COMMENTS to get the 'worst case'
return Pattern.compile(hexPattern,Pattern.COMMENTS).matcher(target).matches();
            } catch (Exception e) {

return false;

            }

        }

void showFailures() {
System.out.format(this+ " Failures: %s\n\tset:%s\n\texample1: %s\n\texampleN: %s\n",
failureSet.size(), failureSet, firstSampleFailure,lastSampleFailure); }
    }

Re: RL1.1 Hex Notation

Reply via email to