The Javadocs for regexp 1.2 claim that \w matches a word character, alphanumeric plus 
"_".  However this is not true, as seen by running the following:

try {
    RE reTest = new RE("\\w");
    System.out.println(reTest.match("a"));
    System.out.println(reTest.match("1"));
    System.out.println(reTest.match("!"));
    System.out.println(reTest.match("_"));
} catch (Exception e) { }


This block of code outputs the following:

true
true
false
false


Notice that the final match of "\w" on "_" fails.  Similarly, the match for a word 
boundary is incorrect:

try {
    RE reTest = new RE("reg\\b");
    System.out.println(reTest.match("reg exp"));
    System.out.println(reTest.match("reg_exp"));
} catch (Exception e) { }


Displays:

true
true


Attached is a patch that to RE.java will treat "_" as an alphanumeric character.  With 
this patch, the final matches on each of the above two examples are flipped:  "\w" 
matching on "_" returns true, and "reg\b" compared to "reg_exp" returns false.  I'd 
like to suggest that this patch be integrated with the next release of regexp.


—--------------------
% diff -ub RE.java.orig RE.java >patchfile.txt
% cat patchfile.txt
--- RE.java.orig        Fri May 11 09:17:00 2001
+++ RE.java     Fri May 11 10:39:29 2001
@@ -1048,7 +1048,9 @@
                             {
                                 char cLast = ((idx == getParenStart(0)) ? '\n' : 
search.charAt(idx - 1));
                                 char cNext = ((search.isEnd(idx)) ? '\n' : 
search.charAt(idx));
-                                if ((Character.isLetterOrDigit(cLast) == 
Character.isLetterOrDigit(cNext)) == (opdata == E_BOUND))
+                                boolean bLast = Character.isLetterOrDigit(cLast) || 
+cLast == '_';
+                                boolean bNext = Character.isLetterOrDigit(cNext) || 
+cNext == '_';
+                                if ((bLast == bNext) == (opdata == E_BOUND))
                                 {
                                     return -1;
                                 }
@@ -1074,7 +1076,8 @@
                             {
                                 case E_ALNUM:
                                 case E_NALNUM:
-                                    if 
(!(Character.isLetterOrDigit(search.charAt(idx)) == (opdata == E_ALNUM)))
+                                    char ch = search.charAt(idx);
+                                    if (!((Character.isLetterOrDigit(ch) || ch =='_') 
+== (opdata == E_ALNUM)))
                                     {
                                         return -1;
                                     }
@@ -1178,7 +1181,8 @@
                         switch (opdata)
                         {
                             case POSIX_CLASS_ALNUM:
-                                if (!Character.isLetterOrDigit(search.charAt(idx)))
+                                char ch = search.charAt(idx);
+                                if (!(Character.isLetterOrDigit(ch) || ch == '_'))
                                 {
                                     return -1;
                                 }
—--------------------

 -- Eric


Reply via email to