In this message I cover only those errors made in the final
section ("Comparison to Perl 5") of:
http://download.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
I really hope no one is offended by this. I don't mean to be
a nitpicker. Technical errors in the documentation should be
very very easy to correct, since no code change is required.
========================================
Comparison to Perl 5
The Pattern engine performs traditional NFA-based matching
with ordered alternation as occurs in Perl 5.
Perl constructs not supported by this class:
* The conditional constructs (?{X}) and (?(condition)X|Y),
That should instead read:
The conditional constructs (?(condition)X) and (?(condition)X|Y),
* The embedded code constructs (?{code}) and (??{code}),
* The embedded comment syntax (?#comment), and
* The preprocessing operations \l \u, \L, and \U.
That is no longer true, as Java supports those now.
There is quite a bit missing from the list of Perl constructs
unsupported by this class.
* Perl regex escapes: \x{...}, \R, \h, \H, \v, \V,
\X, \N, \N{...}, \K, and recently \o{...}.
[NB: My rewrite library covers the top row.]
* Relative buffers like \g{-2} for $-2, or the \g{NAME}
alias for a named backref \k<NAME>.
* The branch-reset operator: (?|...)
* Buffer recursion (?0) (?1) (?&NAME) etc to allow
recursive regexes; e.g. \((?:[^()]*+|(?0))*\) matches
nested parens.
* Non-executing definition-only blocks via (?(DEFINE)...)
to allow the separation of execution from declaration.
See post-sig example.
* Backtracking control verbs like (*MARK:NAME), (*FAIL), (*SKIP)
========================================
Constructs supported by this class but not by Perl:
* Possessive quantifiers, which greedily match as much as
they can and do not back off, even when doing so would
allow the overall match to succeed.
This is not true. Perl understands the same possessive
quantifiers that Java does.
* Character-class union and intersection as described above.
True. In Perl you have to use lookahead assertions to effect
the same end.
Notable differences from Perl:
I would certainly put these two in the very front of this section:
* Perl's charclass shortcuts all work **VERY DIFFERENTLY** from
Java's, including \w \W \s \S \d \D \b \B. [NOTE: my rewrite
library fixes this.]
* Perl supports all official Unicode properties, and follows
all strong recommendations in tr18, whereas Java does neither.
* In Perl, \1 through \9 are always interpreted as back
references; a backslash-escaped number greater than 9 is
treated as a back reference if at least that many
subexpressions exist, otherwise it is interpreted, if
possible, as an octal escape. In this class octal escapes
must always begin with a zero. In this class, \1 through \9
are always interpreted as back references, and a larger
number is accepted as a back reference if at least that
many subexpressions exist at that point in the regular
expression, otherwise the parser will drop digits until the
number is smaller or equal to the existing number of groups
or it is one digit.
I think it more important to state that Perl does not require a 0,
and so \377 is an octal 0xFF. BTW, the new \o{...} is unambiguously
an octal escape just as \g{...} is unambiguously a backref group.
* Perl uses the g flag to request a match that resumes where
the last match left off. This functionality is provided
implicitly by the Matcher class: Repeated invocations of
the find method will resume where the last match left off,
unless the matcher is reset.
I wish there were mention that the Matcher.matches() method adds
implicit boundaries, while Perl does not.
Russ Cox's strategy for RE (re)names that method matches_exactly(),
to better express what it does and clear up confusion.
* In Perl, embedded flags at the top level of an expression
affect the whole expression. In this class, embedded flags
always take effect at the point at which they appear,
whether they are at the top level or within a group; in the
latter case, flags are restored at the end of the group
just as in Perl.
* Perl is forgiving about malformed matching constructs, as
in the expression *a, as well as dangling brackets, as in
the expression abc], and treats them as literals. This
class also accepts dangling brackets but is strict about
dangling metacharacters like +, ? and *, and will throw a
PatternSyntaxException if it encounters them.
This is incorrect; Perl is not forgiving about malformed matching
constructs like the one cited above:
% perl -e '/*a/'
Quantifier follows nothing in regex; marked by<-- HERE in m/*<-- HERE a/
at -e line 1.
Perl also supports user-defined character name aliases for
\N{...} and user-defined character properties for \p{...} and
\P{...}, but Java supports neither. Java doesn't even support
character names at all that I can see, and Java definitely
doesn't support the full complement of character properties as
defined by the Unicode Character Database; Perl does.
I believe that Java does not supported named character sequences,
which are part of what it takes to support Unicode 6.0 as they
are new to that release.
There may be more than this, but it's what came immediately
to mind.
Hope this helps!!
--tom
PS: Here's an example of using (?(DEFINE)...) to completely parse
an RFC 5322 email address, including nested comments. Notice
how much like a BNF grammar this now becomes. It's a Perl 5
thing that we backported from Perl 6: very clean, even beautiful.
$rfc5322 = qr{
(?(DEFINE)
(?<address> (?&mailbox) | (?&group))
(?<mailbox> (?&name_addr) | (?&addr_spec))
(?<name_addr> (?&display_name)? (?&angle_addr))
(?<angle_addr> (?&CFWS)?< (?&addr_spec)> (?&CFWS)?)
(?<group> (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ;
(?&CFWS)?)
(?<display_name> (?&phrase))
(?<mailbox_list> (?&mailbox) (?: , (?&mailbox))*)
(?<addr_spec> (?&local_part) \@ (?&domain))
(?<local_part> (?&dot_atom) | (?"ed_string))
(?<domain> (?&dot_atom) | (?&domain_literal))
(?<domain_literal> (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
\] (?&CFWS)?)
(?<dcontent> (?&dtext) | (?"ed_pair))
(?<dtext> (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])
(?<atext> (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
(?<atom> (?&CFWS)? (?&atext)+ (?&CFWS)?)
(?<dot_atom> (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
(?<dot_atom_text> (?&atext)+ (?: \. (?&atext)+)*)
(?<text> [\x01-\x09\x0b\x0c\x0e-\x7f])
(?<quoted_pair> \\ (?&text))
(?<qtext> (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
(?<qcontent> (?&qtext) | (?"ed_pair))
(?<quoted_string> (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
(?&FWS)? (?&DQUOTE) (?&CFWS)?)
(?<word> (?&atom) | (?"ed_string))
(?<phrase> (?&word)+)
# Folding white space
(?<FWS> (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
(?<ctext> (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
(?<ccontent> (?&ctext) | (?"ed_pair) | (?&comment))
(?<comment> \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
(?<CFWS> (?: (?&FWS)? (?&comment))*
(?: (?:(?&FWS)? (?&comment)) | (?&FWS)))
# No whitespace control
(?<NO_WS_CTL> [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])
(?<ALPHA> [A-Za-z])
(?<DIGIT> [0-9])
(?<CRLF> \x0d \x0a)
(?<DQUOTE> ")
(?<WSP> [\x20\x09])
)
(?&address)
}x;