Hi Mark,
I guess you are asking for something like?
char[] cc = Character.toChars(0x12345);
Matcher m = Pattern.compile("["
+ "\\u" + HEX(cc[0])
+ "\\u" + HEX(cc[1])
+ "]").matcher("");
System.out.println("find=" + m.reset("abc[" + new String(cc) +
"]efg").find());
in which the HEX should be something like below to make it a nnnn.
static String HEX(char c) {
StringBuilder sb = new StringBuilder();
Formatter fm = new Formatter(sb);
fm.format("%04x", (int)c);
return sb.toString();
}
It looks a little tedious, you will probably also have to differentiate
bmp or supplementary
to decide to feed in one utf16 hex or a pair, just to show you can still
use Java Unicode
escape to embed the hex values of the utf16 instead of the "character
itself". Does
it qualify for the RL1.1?
Sure, \x{...} looks more straightforward and convenient. As I said in
previous email
exchange I totally agree it will be a nice enhancement for Java RegEx.
-Sherman
On 1-25-2011 17:00 05:00 PM, Mark Davis ☕ wrote:
The goal of the clause is to have a mechanism for using hex values for
character literals. That is, you should be able to take a code point
from 0 to 10FFFF, get a hex value for that, embed it in some syntax,
and concatenate it into a pattern, and have it work as a literal.
For example:
String pattern = first_part + "\\x{" + hex(myCodePoint) + "}" +
second_part; // for *some* hex notation
...
Matcher m = Pattern.compile(pattern,
Pattern.COMMENTS).matcher(target);
...
As far as I can tell, Java really doesn't supply that capability for
non-BMP, because the \u notation doesn't work above FFFF, except
insofar as the preprocessor maps a surrogate pair in hex to literals,
which happen all to work because they aren't syntax characters.
What you can do with Java is:
1. embed the character itself, not the hex representation, which
works some of the time (fails for 18 characters; syntax
characters, as expected).
2. in constant expressions only, utilize the Java preprocessor with
\u.... or \u....\u....).
3. for BMP characters, use "\u" + hex(myCodePoint,4)
Here is a quick and dirty test; let me know if I've missed something.
*Output:*
LITERALS Failures: 18
set: [\u0009-\u000D\ #\$(-+?\[\\\^\{|]
example1: ab
exampleN: a|b
INLINE Failures: 1048576
set: [\U00010000-\U0010FFFF]
example1: a\uD800\uDC00b
exampleN: a\uDBFF\uDFFFb
INRANGE Failures: 1048576
set: [\U00010000-\U0010FFFF]
example1: a[\uD800\uDC00]b
exampleN: a[\uDBFF\uDFFF]b
*Code:*
public void TestRegex() {
logln("Check patterns for Unicodeset");
for (int i = 0; i <= 0x10FFFF; ++i) {
// The goal is to make a regex with hex digits, and have it match the
corresponding character
// We check two different environments: inline ("aXb") and in a range
("a[X]b")
String s = new StringBuilder().appendCodePoint(i).toString();
String hexPattern = i <= 0xFFFF ? "\\u" + Utility.hex(i,4)
: "\\u" + Utility.hex(Character.toChars(i)[0],4) +
"\\u" + Utility.hex(Character.toChars(i)[1],4);
String target = "a" + s + "b";
Failures.LITERALS.checkMatch(i, "a" + s + "b", target);
Failures.INLINE.checkMatch(i, "a" + hexPattern + "b", target);
Failures.INRANGE.checkMatch(i, "a[" + hexPattern + "]b",
target);
}
Failures.LITERALS.showFailures();
Failures.INLINE.showFailures();
Failures.INRANGE.showFailures();
}
enum Failures {
LITERALS, INLINE, INRANGE;
UnicodeSet failureSet = new UnicodeSet();
String firstSampleFailure;
String lastSampleFailure;
void checkMatch(int codePoint, String pattern, String target) {
if (!matches(pattern, target)) {
failureSet.add(codePoint);
if (firstSampleFailure == null) {
firstSampleFailure = pattern;
}
lastSampleFailure = pattern;
}
}
boolean matches(String hexPattern, String target) {
try {
// use COMMENTS to get the 'worst case'
return Pattern.compile(hexPattern,
Pattern.COMMENTS).matcher(target).matches();
} catch (Exception e) {
return false;
}
}
void showFailures() {
System.out.format(this+ " Failures: %s\n\tset:
%s\n\texample1: %s\n\texampleN: %s\n",
failureSet.size(), failureSet, firstSampleFailure,
lastSampleFailure); }
}