dfs 01/05/17 12:00:58 Modified: src/java/org/apache/oro/text/regex package.html Log: Added description of supported Perl5 regular expression syntax from the old OROMatcher user's guide. This will be moved into a new user's guide. Revision Changes Path 1.2 +130 -1 jakarta-oro/src/java/org/apache/oro/text/regex/package.html Index: package.html =================================================================== RCS file: /home/cvs/jakarta-oro/src/java/org/apache/oro/text/regex/package.html,v retrieving revision 1.1 retrieving revision 1.2 diff -u -r1.1 -r1.2 --- package.html 2000/07/23 23:08:54 1.1 +++ package.html 2001/05/17 19:00:55 1.2 @@ -1,7 +1,136 @@ -<!-- $Id: package.html,v 1.1 2000/07/23 23:08:54 jon Exp $ --> +<!-- $Id: package.html,v 1.2 2001/05/17 19:00:55 dfs Exp $ --> <body> This package used to be the OROMatcher library and provides both generic regular expression interfaces and Perl5 regular expression compatible implementation classes. + +<p> +<em>Note: The following information will be moved into the user's guide.</em> +</p> + +<h1> Perl5 regular expressions </h1> +</a> +<p> +Here we summarize the syntax of Perl5.003 regular expressions, all of +which is supported by the Perl5 classes in this package. However, for +a definitive reference, you should consult the +<a href="http://www.perl.org/CPAN/doc/manual/html/pod/perlre.html"> +<code>perlre</code> man page </a> +that accompanies the Perl5 distribution and also the book +<em> Programming Perl, 2nd Edition </em> from O'Reilly & Associates. +We are working toward implementing the features added after Perl5.003 +up to and including Perl 5.6. Please remember, we only guarantee +support for Perl5.003 expressions in version 2.0. + +<p> +<ul> +<li> Alternatives separated by | +<li> Quantified atoms + <dl compact> + <dt> {n,m} <dd> Match at least n but not more than m times. + <dt> {n,} <dd> Match at least n times. + <dt> {n} <dd> Match exactly n times. + <dt> * <dd> Match 0 or more times. + <dt> + <dd> Match 1 or more times. + <dt> ? <dd> Match 0 or 1 times. + </dl> + <li> Atoms + <ul> + <li> regular expression within parentheses + <li> a . matches everything except \n + <li> a ^ is a null token matching the beginning of a string or line + (i.e., the position right after a newline or right before + the beginning of a string) + <li> a $ is a null token matching the end of a string or line + (i.e., the position right before a newline or right after + the end of a string) + <li> Character classes (e.g., [abcd]) and ranges (e.g. [a-z]) + <ul> + <li> Special backslashed characters work within a character + class (except for backreferences and boundaries). + <li> \b is backspace inside a character class + </ul> + <li> Special backslashed characters + <dl compact> + <dt> \b <dd> null token matching a word boundary (\w on one side + and \W on the other) + <dt> \B <dd> null token matching a boundary that isn't a + word boundary + <dt> \A <dd> Match only at beginning of string + <dt> \Z <dd> Match only at end of string (or before newline + at the end) + <dt> \n <dd> newline + <dt> \r <dd> carriage return + <dt> \t <dd> tab + <dt> \f <dd> formfeed + <dt> \d <dd> digit [0-9] + <dt> \D <dd> non-digit [^0-9] + <dt> \w <dd> word character [0-9a-z_A-Z] + <dt> \W <dd> a non-word character [^0-9a-z_A-Z] + <dt> \s <dd> a whitespace character [ \t\n\r\f] + <dt> \S <dd> a non-whitespace character [^ \t\n\r\f] + <dt> \xnn <dd> hexadecimal representation of character + <dt> \cD <dd> matches the corresponding control character + <dt> \nn or \nnn <dd> octal representation of character + unless a backreference. a + <dt> \1, \2, \3, etc. <dd> match whatever the first, second, + third, etc. parenthesized group matched. This is called a + backreference. If there is no corresponding group, the + number is interpreted as an octal representation of a character. + <dt> \0 <dd> matches null character + <dt> Any other backslashed character matches itself + </dl> + </ul> + <li> Expressions within parentheses are matched as subpattern groups + and saved for use by certain methods. + </ul> + +<p> +By default, a quantified subpattern is <em> greedy </em>. +In other words it matches as many times as possible without causing +the rest of the pattern not to match. To change the quantifiers +to match the minimum number of times possible, without +causing the rest of the pattern not to match, you may use +a "?" right after the quantifier. + +<dl compact> +<dt> *? <dd> Match 0 or more times +<dt> +? <dd> Match 1 or more times +<dt> ?? <dd> Match 0 or 1 time +<dt> {n}? <dd> Match exactly n times +<dt> {n,}? <dd> Match at least n times +<dt> {n,m}? <dd> Match at least n but not more than m times +</dl> + +<p> +<b> Perl5 extended regular expressions </b> are fully supported. + +<dl compact> +<dt> (?#text) <dd> An embedded comment causing text to be ignored. +<dt> (?:regexp) <dd> Groups things like "()" but doesn't cause the + group match to be saved. +<dt> (?=regexp) <dd> + A zero-width positive lookahead assertion. For + example, \w+(?=\s) matches a word followed by + whitespace, without including whitespace in the + MatchResult. + +<dt> (?!regexp) <dd> + A zero-width negative lookahead assertion. For + example foo(?!bar) matches any occurrence of + "foo" that isn't followed by "bar". Remember + that this is a zero-width assertion, which means + that a(?!b)d will match ad because a is followed + by a character that is not b (the d) and a d + follows the zero-width assertion. + + +<dt> (?imsx) <dd> One or more embedded pattern-match modifiers. + i enables case insensitivity, m enables multiline + treatment of the input, s enables single line treatment + of the input, and x enables extended whitespace comments. +</ul> + + </body>
