The Javadocs for regexp 1.2 claim that \w matches a word character, alphanumeric plus
"_". However this is not true, as seen by running the following:
try {
RE reTest = new RE("\\w");
System.out.println(reTest.match("a"));
System.out.println(reTest.match("1"));
System.out.println(reTest.match("!"));
System.out.println(reTest.match("_"));
} catch (Exception e) { }
This block of code outputs the following:
true
true
false
false
Notice that the final match of "\w" on "_" fails. Similarly, the match for a word
boundary is incorrect:
try {
RE reTest = new RE("reg\\b");
System.out.println(reTest.match("reg exp"));
System.out.println(reTest.match("reg_exp"));
} catch (Exception e) { }
Displays:
true
true
Attached is a patch that to RE.java will treat "_" as an alphanumeric character. With
this patch, the final matches on each of the above two examples are flipped: "\w"
matching on "_" returns true, and "reg\b" compared to "reg_exp" returns false. I'd
like to suggest that this patch be integrated with the next release of regexp.
—--------------------
% diff -ub RE.java.orig RE.java >patchfile.txt
% cat patchfile.txt
--- RE.java.orig Fri May 11 09:17:00 2001
+++ RE.java Fri May 11 10:39:29 2001
@@ -1048,7 +1048,9 @@
{
char cLast = ((idx == getParenStart(0)) ? '\n' :
search.charAt(idx - 1));
char cNext = ((search.isEnd(idx)) ? '\n' :
search.charAt(idx));
- if ((Character.isLetterOrDigit(cLast) ==
Character.isLetterOrDigit(cNext)) == (opdata == E_BOUND))
+ boolean bLast = Character.isLetterOrDigit(cLast) ||
+cLast == '_';
+ boolean bNext = Character.isLetterOrDigit(cNext) ||
+cNext == '_';
+ if ((bLast == bNext) == (opdata == E_BOUND))
{
return -1;
}
@@ -1074,7 +1076,8 @@
{
case E_ALNUM:
case E_NALNUM:
- if
(!(Character.isLetterOrDigit(search.charAt(idx)) == (opdata == E_ALNUM)))
+ char ch = search.charAt(idx);
+ if (!((Character.isLetterOrDigit(ch) || ch =='_')
+== (opdata == E_ALNUM)))
{
return -1;
}
@@ -1178,7 +1181,8 @@
switch (opdata)
{
case POSIX_CLASS_ALNUM:
- if (!Character.isLetterOrDigit(search.charAt(idx)))
+ char ch = search.charAt(idx);
+ if (!(Character.isLetterOrDigit(ch) || ch == '_'))
{
return -1;
}
—--------------------
-- Eric