In this message I cover only those errors made in the final section ("Comparison to Perl 5") of:
http://download.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html I really hope no one is offended by this. I don't mean to be a nitpicker. Technical errors in the documentation should be very very easy to correct, since no code change is required. ======================================== > Comparison to Perl 5 > The Pattern engine performs traditional NFA-based matching > with ordered alternation as occurs in Perl 5. > Perl constructs not supported by this class: > * The conditional constructs (?{X}) and (?(condition)X|Y), That should instead read: The conditional constructs (?(condition)X) and (?(condition)X|Y), > * The embedded code constructs (?{code}) and (??{code}), > * The embedded comment syntax (?#comment), and > * The preprocessing operations \l \u, \L, and \U. That is no longer true, as Java supports those now. There is quite a bit missing from the list of Perl constructs unsupported by this class. * Perl regex escapes: \x{...}, \R, \h, \H, \v, \V, \X, \N, \N{...}, \K, and recently \o{...}. [NB: My rewrite library covers the top row.] * Relative buffers like \g{-2} for $-2, or the \g{NAME} alias for a named backref \k<NAME>. * The branch-reset operator: (?|...) * Buffer recursion (?0) (?1) (?&NAME) etc to allow recursive regexes; e.g. \((?:[^()]*+|(?0))*\) matches nested parens. * Non-executing definition-only blocks via (?(DEFINE)...) to allow the separation of execution from declaration. See post-sig example. * Backtracking control verbs like (*MARK:NAME), (*FAIL), (*SKIP) ======================================== > Constructs supported by this class but not by Perl: > * Possessive quantifiers, which greedily match as much as > they can and do not back off, even when doing so would > allow the overall match to succeed. This is not true. Perl understands the same possessive quantifiers that Java does. > * Character-class union and intersection as described above. True. In Perl you have to use lookahead assertions to effect the same end. > Notable differences from Perl: I would certainly put these two in the very front of this section: * Perl's charclass shortcuts all work **VERY DIFFERENTLY** from Java's, including \w \W \s \S \d \D \b \B. [NOTE: my rewrite library fixes this.] * Perl supports all official Unicode properties, and follows all strong recommendations in tr18, whereas Java does neither. > * In Perl, \1 through \9 are always interpreted as back > references; a backslash-escaped number greater than 9 is > treated as a back reference if at least that many > subexpressions exist, otherwise it is interpreted, if > possible, as an octal escape. In this class octal escapes > must always begin with a zero. In this class, \1 through \9 > are always interpreted as back references, and a larger > number is accepted as a back reference if at least that > many subexpressions exist at that point in the regular > expression, otherwise the parser will drop digits until the > number is smaller or equal to the existing number of groups > or it is one digit. I think it more important to state that Perl does not require a 0, and so \377 is an octal 0xFF. BTW, the new \o{...} is unambiguously an octal escape just as \g{...} is unambiguously a backref group. > * Perl uses the g flag to request a match that resumes where > the last match left off. This functionality is provided > implicitly by the Matcher class: Repeated invocations of > the find method will resume where the last match left off, > unless the matcher is reset. I wish there were mention that the Matcher.matches() method adds implicit boundaries, while Perl does not. Russ Cox's strategy for RE (re)names that method matches_exactly(), to better express what it does and clear up confusion. > * In Perl, embedded flags at the top level of an expression > affect the whole expression. In this class, embedded flags > always take effect at the point at which they appear, > whether they are at the top level or within a group; in the > latter case, flags are restored at the end of the group > just as in Perl. > * Perl is forgiving about malformed matching constructs, as > in the expression *a, as well as dangling brackets, as in > the expression abc], and treats them as literals. This > class also accepts dangling brackets but is strict about > dangling metacharacters like +, ? and *, and will throw a > PatternSyntaxException if it encounters them. This is incorrect; Perl is not forgiving about malformed matching constructs like the one cited above: % perl -e '/*a/' Quantifier follows nothing in regex; marked by <-- HERE in m/* <-- HERE a/ at -e line 1. Perl also supports user-defined character name aliases for \N{...} and user-defined character properties for \p{...} and \P{...}, but Java supports neither. Java doesn't even support character names at all that I can see, and Java definitely doesn't support the full complement of character properties as defined by the Unicode Character Database; Perl does. I believe that Java does not supported named character sequences, which are part of what it takes to support Unicode 6.0 as they are new to that release. There may be more than this, but it's what came immediately to mind. Hope this helps!! --tom PS: Here's an example of using (?(DEFINE)...) to completely parse an RFC 5322 email address, including nested comments. Notice how much like a BNF grammar this now becomes. It's a Perl 5 thing that we backported from Perl 6: very clean, even beautiful. $rfc5322 = qr{ (?(DEFINE) (?<address> (?&mailbox) | (?&group)) (?<mailbox> (?&name_addr) | (?&addr_spec)) (?<name_addr> (?&display_name)? (?&angle_addr)) (?<angle_addr> (?&CFWS)? < (?&addr_spec) > (?&CFWS)?) (?<group> (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?) (?<display_name> (?&phrase)) (?<mailbox_list> (?&mailbox) (?: , (?&mailbox))*) (?<addr_spec> (?&local_part) \@ (?&domain)) (?<local_part> (?&dot_atom) | (?"ed_string)) (?<domain> (?&dot_atom) | (?&domain_literal)) (?<domain_literal> (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)? \] (?&CFWS)?) (?<dcontent> (?&dtext) | (?"ed_pair)) (?<dtext> (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e]) (?<atext> (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~]) (?<atom> (?&CFWS)? (?&atext)+ (?&CFWS)?) (?<dot_atom> (?&CFWS)? (?&dot_atom_text) (?&CFWS)?) (?<dot_atom_text> (?&atext)+ (?: \. (?&atext)+)*) (?<text> [\x01-\x09\x0b\x0c\x0e-\x7f]) (?<quoted_pair> \\ (?&text)) (?<qtext> (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e]) (?<qcontent> (?&qtext) | (?"ed_pair)) (?<quoted_string> (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))* (?&FWS)? (?&DQUOTE) (?&CFWS)?) (?<word> (?&atom) | (?"ed_string)) (?<phrase> (?&word)+) # Folding white space (?<FWS> (?: (?&WSP)* (?&CRLF))? (?&WSP)+) (?<ctext> (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e]) (?<ccontent> (?&ctext) | (?"ed_pair) | (?&comment)) (?<comment> \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) ) (?<CFWS> (?: (?&FWS)? (?&comment))* (?: (?:(?&FWS)? (?&comment)) | (?&FWS))) # No whitespace control (?<NO_WS_CTL> [\x01-\x08\x0b\x0c\x0e-\x1f\x7f]) (?<ALPHA> [A-Za-z]) (?<DIGIT> [0-9]) (?<CRLF> \x0d \x0a) (?<DQUOTE> ") (?<WSP> [\x20\x09]) ) (?&address) }x;