On 01/26/2011 11:50 AM, Mark Davis ☕ wrote:
> I guess you are asking for something like?
I'm not asking for that. What I'm saying is that as far as I can tell,
there is no way in Java to meet the terms of RL1.1, because there is
not a way to use hex numbers in any syntax for values above FFFF to
indicate literals. That is, if you supply "abc\\uD800\\uDC00def" then
regex fails.
The code was my attempt to try to get something to work even using
separate surrogates (which was not the intent of RL1.1), but even that
failed. Maybe there is another way to do it?
Mark
//
Oh, I see the problem. Obviously I have been working on jdk7 too long
and forgot the
latest release is still 6:-( There is indeed a bug in the previous
implementation which I
fixed in 7 long time ago (I mentioned this in one of the early emails
but was not specific,
my apology), probably should backport to 6 update release asap. The test
case runs well
(the "failures" in literals are expected) on 7 with the following
output. I modified your test
case "slightly" since it appears the UnicodeSet class in our normalizer
package does not
have the size(), replace it with a normal hashset.
-Sherman
------------------------------------------------------------------
LITERALS Failures: 18
set: [9, 10, 11, 12, 13, 32, 35, 36, 40, 41, 42, 43, 63, 91, 92,
94, 123, 124]
example1: a b
exampleN: a|b
INLINE Failures: 0
set: []
example1: null
exampleN: null
INRANGE Failures: 0
set: []
example1: null
exampleN: null
-----------------------------------------------------------------------
import java.util.regex.*;
import java.util.*;
import sun.text.normalizer.*;
public class TestRegex2 {
public static void main(String[] args) {
System.out.println("Check patterns for Unicodeset");
for (int i = 0; i <= 0x10FFFF; ++i) {
// The goal is to make a regex with hex digits, and have it
match the corresponding character
// We check two different environments: inline ("aXb") and
in a range ("a[X]b")
String s = new StringBuilder().appendCodePoint(i).toString();
String hexPattern = i <= 0xFFFF ? "\\u" + Utility.hex(i,4)
: "\\u" + Utility.hex(Character.toChars(i)[0],4) +
"\\u" + Utility.hex(Character.toChars(i)[1],4);
String target = "a" + s + "b";
Failures.LITERALS.checkMatch(i, "a" + s + "b", target);
Failures.INLINE.checkMatch(i, "a" + hexPattern + "b", target);
Failures.INRANGE.checkMatch(i, "a[" + hexPattern + "]b",
target);
}
Failures.LITERALS.showFailures();
Failures.INLINE.showFailures();
Failures.INRANGE.showFailures();
}
static enum Failures {
LITERALS, INLINE, INRANGE;
Set<Integer> failureSet = new LinkedHashSet<Integer>();
String firstSampleFailure;
String lastSampleFailure;
void checkMatch(int codePoint, String pattern, String target) {
if (!matches(pattern, target)) {
failureSet.add(codePoint);
if (firstSampleFailure == null) {
firstSampleFailure = pattern;
}
lastSampleFailure = pattern;
}
}
boolean matches(String hexPattern, String target) {
try {
// use COMMENTS to get the 'worst case'
return Pattern.compile(hexPattern,
Pattern.COMMENTS).matcher(target).matches();
} catch (Exception e) {
return false;
}
}
void showFailures() {
System.out.format(this + " Failures: %s\n\tset:
%s\n\texample1: %s\n\texampleN: %s\n",
failureSet.size(), failureSet, firstSampleFailure,
lastSampleFailure); }
}