Hi Mark,

I guess you are asking for something like?

        char[] cc = Character.toChars(0x12345);
        Matcher m = Pattern.compile("["
                                     + "\\u" + HEX(cc[0])
                                     + "\\u" + HEX(cc[1])
                                     + "]").matcher("");
System.out.println("find=" + m.reset("abc[" + new String(cc) + "]efg").find());

in which the HEX should be something like below to make it a nnnn.

    static String HEX(char c) {
        StringBuilder sb = new StringBuilder();
        Formatter fm = new Formatter(sb);
        fm.format("%04x", (int)c);
        return sb.toString();
    }

It looks a little tedious, you will probably also have to differentiate bmp or supplementary to decide to feed in one utf16 hex or a pair, just to show you can still use Java Unicode escape to embed the hex values of the utf16 instead of the "character itself". Does
it qualify for the RL1.1?

Sure, \x{...} looks more straightforward and convenient. As I said in previous email
exchange I totally agree it will be a nice enhancement for Java RegEx.

-Sherman

On 1-25-2011 17:00 05:00 PM, Mark Davis ☕ wrote:
The goal of the clause is to have a mechanism for using hex values for character literals. That is, you should be able to take a code point from 0 to 10FFFF, get a hex value for that, embed it in some syntax, and concatenate it into a pattern, and have it work as a literal.

For example:

    String pattern = first_part + "\\x{" + hex(myCodePoint) + "}" +
    second_part; // for *some* hex notation
    ...
    Matcher m = Pattern.compile(pattern,
    Pattern.COMMENTS).matcher(target);
    ...


As far as I can tell, Java really doesn't supply that capability for non-BMP, because the \u notation doesn't work above FFFF, except insofar as the preprocessor maps a surrogate pair in hex to literals, which happen all to work because they aren't syntax characters.

What you can do with Java is:

   1. embed the character itself, not the hex representation, which
      works some of the time (fails for 18 characters; syntax
      characters, as expected).
   2. in constant expressions only, utilize the Java preprocessor with
      \u.... or \u....\u....).
   3. for BMP characters, use "\u" + hex(myCodePoint,4)

Here is a quick and dirty test; let me know if I've missed something.

*Output:*

LITERALS Failures: 18

        set: [\u0009-\u000D\ #\$(-+?\[\\\^\{|]

        example1: ab

        exampleN: a|b

INLINE Failures: 1048576

        set: [\U00010000-\U0010FFFF]

        example1: a\uD800\uDC00b

        exampleN: a\uDBFF\uDFFFb

INRANGE Failures: 1048576

        set: [\U00010000-\U0010FFFF]

        example1: a[\uD800\uDC00]b

        exampleN: a[\uDBFF\uDFFF]b


*Code:*

public void TestRegex() {

        logln("Check patterns for Unicodeset");


for (int i = 0; i <= 0x10FFFF; ++i) {


// The goal is to make a regex with hex digits, and have it match the corresponding character

// We check two different environments: inline ("aXb") and in a range ("a[X]b")


            String s = new StringBuilder().appendCodePoint(i).toString();


            String hexPattern = i <= 0xFFFF ? "\\u" + Utility.hex(i,4)

: "\\u" + Utility.hex(Character.toChars(i)[0],4) + "\\u" + Utility.hex(Character.toChars(i)[1],4);


            String target = "a" + s + "b";


            Failures.LITERALS.checkMatch(i, "a" + s + "b", target);

            Failures.INLINE.checkMatch(i, "a" + hexPattern + "b", target);

Failures.INRANGE.checkMatch(i, "a[" + hexPattern + "]b", target);

        }

        Failures.LITERALS.showFailures();

        Failures.INLINE.showFailures();

        Failures.INRANGE.showFailures();

    }


enum Failures {

LITERALS, INLINE, INRANGE;

        UnicodeSet failureSet = new UnicodeSet();

        String firstSampleFailure;

        String lastSampleFailure;


void checkMatch(int codePoint, String pattern, String target) {

if (!matches(pattern, target)) {

failureSet.add(codePoint);

if (firstSampleFailure == null) {

firstSampleFailure = pattern;

                }

lastSampleFailure = pattern;

            }

        }

boolean matches(String hexPattern, String target) {

try {

// use COMMENTS to get the 'worst case'

return Pattern.compile(hexPattern, Pattern.COMMENTS).matcher(target).matches();

            } catch (Exception e) {

return false;

            }

        }

void showFailures() {

System.out.format(this+ " Failures: %s\n\tset: %s\n\texample1: %s\n\texampleN: %s\n",

failureSet.size(), failureSet, firstSampleFailure, lastSampleFailure); }

    }


Reply via email to