Hi all, I have started using the regexp package because is nice and lightweight but found I could not use clustering. I think it might be a Perl extension to re but found it was easy to implement in this package. This allows the use of the following style matching. (?:\w+(?:\s\w+)+ Mary had a little lamb This will match with the only paren (0) returning the full string. A better example is domain names (simplified here not sure if it complies with the relevant RFC.)... ([a-zA-Z0-9]+(?:\.[a-zA-Z0-9]+)*) www.test.com jakarta.apache.org Will both match. with paren 0 having the full string. Now take the above expression and add the protocol... (:?\w+://)?([a-zA-Z0-9]+(?:\.[a-zA-Z0-9]+)*) http://www.test.com Paren 0 = http://www.test.com Paren 1 = www.test.com Anyway there are about 10 tests in RETest.txt that demostrate this. regards, Michael p.s. I think that I striped a bunch of spaces from the end of lines so there are a bunch of extra line in the patch. Not very familiar with using diff :)
? bin ? Clustering.patch ? Clustering2.patch ? build/run-tests.sh Index: docs/RETest.txt =================================================================== RCS file: /home/cvspublic/jakarta-regexp/docs/RETest.txt,v retrieving revision 1.1 diff -r1.1 RETest.txt 886a887,980 > > #149 > (?:a) > a > YES > a > > #150 > (?:a) > aa > YES > a > > #151 > (?:\w) > abc > YES > a > > #152 > (?:\w\s\w)+ > a b c > YES > a b > > #153 > (a\w)(?:,(a\w))+ > ab,ac,ad > YES > ab,ac,ad > ab > ad > > #154 > z(\w\s+(?:\w\s+\w)+)z > za b bc cd dz > YES > za b bc cd dz > a b bc cd d > > #155 > (([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)* > http://www.test.com > YES > http://www.test.com > http:// > http > .com > > #156 > ((?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)* > ftp://www.test.com > YES > ftp://www.test.com > ftp:// > .com > > #157 > (([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)* > htTp://www.test.com > YES > htTp://www.test.com > htTp:// > htTp > > #158 > (?:([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)* > FTP://www.test.com > YES > FTP://www.test.com > FTP > .com > > #159 > ^(?:([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*$ > http://.www.test.com > NO > > #160 > ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$ > FtP://www.test.com > YES > FtP://www.test.com > > #161 > ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$ > FtTP://www.test.com > NO > > #162 > ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$ > www.test.com > YES > www.test.com Index: src/java/org/apache/regexp/RE.java =================================================================== RCS file: /home/cvspublic/jakarta-regexp/src/java/org/apache/regexp/RE.java,v retrieving revision 1.6 diff -r1.6 RE.java 176,186c176,186 < * [:alnum:] Alphanumeric characters. < * [:alpha:] Alphabetic characters. < * [:blank:] Space and tab characters. < * [:cntrl:] Control characters. < * [:digit:] Numeric characters. < * [:graph:] Characters that are printable and are also visible. (A space is printable, but not visible, while an `a' is both.) < * [:lower:] Lower-case alphabetic characters. < * [:print:] Printable characters (characters that are not control characters.) < * [:punct:] Punctuation characters (characters that are not letter, digits, control characters, or space characters). < * [:space:] Space characters (such as space, tab, and formfeed, to name a few). < * [:upper:] Upper-case alphabetic characters. --- > * [:alnum:] Alphanumeric characters. > * [:alpha:] Alphabetic characters. > * [:blank:] Space and tab characters. > * [:cntrl:] Control characters. > * [:digit:] Numeric characters. > * [:graph:] Characters that are printable and are also visible. (A >space is printable, but not visible, while an `a' is both.) > * [:lower:] Lower-case alphabetic characters. > * [:print:] Printable characters (characters that are not control >characters.) > * [:punct:] Punctuation characters (characters that are not letter, >digits, control characters, or space characters). > * [:space:] Space characters (such as space, tab, and formfeed, to >name a few). > * [:upper:] Upper-case alphabetic characters. 188c188 < * --- > * 199c199 < * --- > * 254a255 > * (?:A) Used for subexpression clustering (just like grouping but >no backrefs) 399a401 > static final char OP_OPEN_CLUSTER = '<'; // opening cluster 400a403 > static final char OP_CLOSE_CLUSTER = '>'; // closing cluster 421c424 < static final char POSIX_CLASS_ALPHA = 'a'; // Alphabetics --- > static final char POSIX_CLASS_ALPHA = 'a'; // Alphabetics 947a951,955 > > case OP_OPEN_CLUSTER: > case OP_CLOSE_CLUSTER: > // starting or ending the matching of a subexpression which has >no backref. > return matchNodes( next, maxNode, idx ); Index: src/java/org/apache/regexp/RECompiler.java =================================================================== RCS file: /home/cvspublic/jakarta-regexp/src/java/org/apache/regexp/RECompiler.java,v retrieving revision 1.2 diff -r1.2 RECompiler.java 1191c1191 < boolean paren = false; --- > int paren = -1; 1196,1198c1196,1208 < idx++; < paren = true; < ret = node(RE.OP_OPEN, parens++); --- > // if its a cluster ( rather than a proper subexpression ie with >backrefs ) > if ( idx + 2 < len && pattern.charAt( idx + 1 ) == '?' && >pattern.charAt( idx + 2 ) == ':' ) > { > paren = 2; > idx += 3; > ret = node( RE.OP_OPEN_CLUSTER, 0 ); > } > else > { > paren = 1; > idx++; > ret = node(RE.OP_OPEN, parens++); > } 1223c1233 < if (paren) --- > if ( paren > 0 ) 1233c1243,1250 < end = node(RE.OP_CLOSE, closeParens); --- > if ( paren == 1 ) > { > end = node(RE.OP_CLOSE, closeParens); > } > else > { > end = node( RE.OP_CLOSE_CLUSTER, 0 ); > } Index: src/java/org/apache/regexp/RETest.java =================================================================== RCS file: /home/cvspublic/jakarta-regexp/src/java/org/apache/regexp/RETest.java,v retrieving revision 1.2 diff -r1.2 RETest.java 58c58 < */ --- > */ 89,90c89,90 < //new RETest(arg); < test(); --- > new RETest(arg); > //test(); Index: xdocs/RETest.txt =================================================================== RCS file: /home/cvspublic/jakarta-regexp/xdocs/RETest.txt,v retrieving revision 1.1 diff -r1.1 RETest.txt 886a887,980 > > #149 > (?:a) > a > YES > a > > #150 > (?:a) > aa > YES > a > > #151 > (?:\w) > abc > YES > a > > #152 > (?:\w\s\w)+ > a b c > YES > a b > > #153 > (a\w)(?:,(a\w))+ > ab,ac,ad > YES > ab,ac,ad > ab > ad > > #154 > z(\w\s+(?:\w\s+\w)+)z > za b bc cd dz > YES > za b bc cd dz > a b bc cd d > > #155 > (([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)* > http://www.test.com > YES > http://www.test.com > http:// > http > .com > > #156 > ((?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)* > ftp://www.test.com > YES > ftp://www.test.com > ftp:// > .com > > #157 > (([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)* > htTp://www.test.com > YES > htTp://www.test.com > htTp:// > htTp > > #158 > (?:([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)* > FTP://www.test.com > YES > FTP://www.test.com > FTP > .com > > #159 > ^(?:([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*$ > http://.www.test.com > NO > > #160 > ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$ > FtP://www.test.com > YES > FtP://www.test.com > > #161 > ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$ > FtTP://www.test.com > NO > > #162 > ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$ > www.test.com > YES > www.test.com