Sherman, The comparison to Perl 5 in the Java Pattern class documentation needs to be corrected. However, I would not recommend as long a laundry list of missing features from either side as the following email might imply. I'm just trying to be complete, but in doing so, it produces a list that I think is too unruly for inclusion. Part of that, however, may be because I have included a lot of auxiliarly information and examples to show you what I mean. Those of course don't need to go in the javadoc.
My minimal suggested change would be to bring it alignment with the current production release of Perl instead of one from the previous millennium -- and in some cases, from much older still. Whether you choose 5.12 or 5.14, you should clearlyi state *which* version of Perl you're comparing yourself with: it is the lack of reference version number that caused this to become so false. Sherman, you do a much better than I do in patching javadoc in a way consistent in tone and texture, so I am comfortable leaving this to your discretion. I hope this helps. If there's anything more I can do to help, please do not hesitate to ask. Thank you for all your work; I am quite enthusiastic about all of this. --tom > Comparison to Perl 5 This was applicable to 2000's Perl 5.6 release, and also to a much older version of the Java Pattern class. Both have advanced beyond what the comparison claims. > The Pattern engine performs traditional NFA-based matching with > ordered alternation as occurs in Perl 5. Although I agree that Perl and Java use the same sort of matcher, I'm not sure it is accurate to call it a traditional NFA matcher. Both are recursive backtracking matchers, necessitated by the backref support. The difference between these two algorithms is well explained in Russ Cox's paper on "Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)" http://swtch.com/~rsc/regexp/regexp1.html The Cox paper shows how pathological patterns cause a recursive backtracking algorithm to degrade exponentially with respect to input length, and how that does not occur under a traditional NFA. It is easy to demonstrate this issue from the command line: $ time perl -le 'print(("a" x 19) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null 2.803u 0.000s 0:02.80 $ time perl -le 'print(("a" x 20) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null 4.077u 0.002s 0:04.08 $ time perl -le 'print(("a" x 21) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null 6.039u 0.003s 0:06.04 $ time perl -le 'print(("a" x 22) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null 8.756u 0.000s 0:08.76 In contrast, if you swap in Cox's RE2 library (this is a CPAN module) in place of Perl's default regex engine, that all disappears: $ time perl -Mre::engine::RE2 -le 'print(("a" x 19) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null 0.001u 0.003s 0:00.00 $ time perl -Mre::engine::RE2 -le 'print(("a" x 50) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null 0.002u 0.000s 0:00.00 $ time perl -Mre::engine::RE2 -le 'print(("a" x 500) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null 0.001u 0.002s 0:00.00 $ time perl -Mre::engine::RE2 -le 'print(("a" x 5000) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]i || 0)' > /dev/null 0.001u 0.000s 0:00.00 That's because Cox is using a traditional NFA, but Perl (by default) and Java (always) are both using a recursive backtracker variant of the same. Read Cox; he explains it more clearly than I have. > Perl constructs not supported by this class: > The conditional constructs (?{X}) and (?(condition)X|Y), > The embedded code constructs (?{code}) and (??{code}), > The embedded comment syntax (?#comment), and > The preprocessing operations \l, \u, \L, and \U. Well, yes, but those are string-interpolation things: they don't happen in the regex compiler; likewise \Q. If you pass a string with \Q or \U in it to the regex compiler but not through the double-quote interpolation, such as if you read it from a file, then those do not happen. Here are other things that are missing. Perl release numbers follow the convention that odd numbers are developer releases and even numbers are production releases. I shall therefore only mention even-numbered releases. == Since the Perl 5.6 release of 2000, Perl also supports these constructs not supported by the Java Pattern class: * Unicode grapheme clusters via the \X. * Unicode named characters (the Name property) using the \N{NAME} escape via the charnames pragma. This includes those from NameAliases.txt. * ALL Unicode properties supported by whatever version of the UCD is current at the time of release, not just those from UnicodeData.txt; see http://unicode.org/reports/tr44/#Property_Index for the current list, or the perluniprops manpage on perl 5.12 or better. * Loose matching of property names and values, including the full names plus all those defined by The Unicode Standard as valid aliases/shortcuts for the same; see also PropertyAliases.txt and PropValueAliases.txt. * User-defined \p{PROP} properties: you get to make up your own property names and definitions for use in regexes. This tailoring is quite useful. * Full Unicode casefolding (multichar folds), not just simple casefolding where all folds are to a single code point alone. == Since the Perl 5.8 release of 2002, Perl also supports these constructs not supported by the Java Pattern class: * Custom user-defined named characters va \N{NAME}. == Since the Perl 5.10 release of 2007, Perl also supports these constructs not supported by the Java Pattern class: * Horizontal Unicode whitespace via \h and \H. * Vertical Unicode whitespace via \v and \V. * Any Unicode linebreak sequence via \R. * The \K "keep this" escape to not include anything to its left in what gets matched; works like a variable-width lookbehind, which is otherwise disallowed. * The \g{GROUP} notation for backrefs, including normal \g{1}, relative \g{-1}, and named \g{NAME}. This allows you to avoid octal ambiguity and makes for more robustly embeddable patterns. * The branch-reset operator, (?| (.)(.) | (.)(.) | (.)(.) ), which causes group numbering to restart at each | branch. * Multiple named groups by the same name: (?<NAME>...) ... (?<NAME>...) After the match, both those are accessible. * Recursive patterns through buffer recursion. For example, to match for nested parens: \((?:[^()]*+|(?0))*\) Yes, Perl patterns are now equivalent to recursive- descent parsing, a quantum leap forward. See also the DEFINE block two items below. * Backtracking control verbs like (*SKIP) and (*MARK) * Definition-only groups via (?(DEFINE)...) for later execution via (?&NAME), like a regex subroutine: (?x) (?<NAME>(?&NAME_PAT)) (?<ADDR>(?&ADDRESS_PAT)) (?(DEFINE) (?<NAME_PAT>....) (?<ADRESS_PAT>....) This lets you separate declaration from execution, reuse named abstractions, etc. It is extremely powerful and extremely useful. Note that is was this release in which Perl gained: * Named groups via (?<NAME>...) and \k<NAME>. * Possessive matches via ++, *+, etc. == Since the Perl 5.12 release of 2010, Perl also supports these constructs not supported by the Java Pattern class: * The new \N escape to always mean [^\n], even under (?s) matching. This is without braces; with braces it is of course a Unicode named character or sequence. * The \X escape, supported since 5.6, has tracked the Unicode standard and therefore with this release now matches an extended grapheme cluster per UAX#29. == Since the Perl 5.14 release of 2011, Perl also supports these constructs not supported by the Java Pattern class: * The new-to-Unicode-6.0 "named sequences" via \N{NAME}. See NamedSequences.txt. * The \o{...} octal escape to guarantee that you not only never have any \1-style ambiguities with backref \g{10} vs octal \o{10}, but also so you can abut an octally specified code point number against other unrelated digits without mistakenly incorporating them into the octoal. BTW, here are which Perl release tracked which Unicode release: Perl Unicode version version 5.6 3.0.0 5.8 3.2.0 5.8.1 4.0.0 5.8.9 5.1.0 5.12 5.2.0 5.14 6.0.0 (I've obviously omitted lots of intermediate releases) > Constructs supported by this class but not by Perl: > Possessive quantifiers, which greedily match as much as they can > and do not back off, even when doing so would allow the overall > match to succeed. Perl has been able to do that for some years now. > Character-class union and intersection as described above. This is kinda true and kinda not; in the core regex library, we implement this not by using the Unicode syntax, but rather with either lookaheads or user-defined character properties. To get the full Unicode syntax requires the Unicode::Regex::Set module, which is not part of the core regex engine. Speaking of which, Perl has quite a few modules that implement various portions of The Unicode Standard, especially the annexes: Unicode::Casing - Perl extension to override system case changing functions Unicode::Collate - Unicode Collation Algorithm Unicode::Collate::Locale - Linguistic tailoring for DUCET via Unicode::Collate Unicode::GCString - String as Sequence of UAX #29 Grapheme Clusters Unicode::LineBreak - UAX #14 Unicode Line Breaking Algorithm Unicode::Normalize - Unicode Normalization Forms Unicode::Regex::Set - Subtraction and Intersection of Character Sets in Unicode Regular Expressions Unicode::Stringprep - Preparation of Internationalized Strings (RFC 3454) Unicode::UCD - Unicode character database Unicode::Unihan - The Unihan Data Base Many of those I use daily. Some of these could arguably be incorporated into the core regex engine. But as even today there are still issues involving canonical matching, it's perhaps good that they are decoupled. > Notable differences from Perl: > In Perl, \1 through \9 are always interpreted as back references; a > backslash-escaped number greater than 9 is treated as a back > reference if at least that many subexpressions exist, otherwise it is > interpreted, if possible, as an octal escape. In this class octal > escapes must always begin with a zero. In this class, \1 through \9 > are always interpreted as back references, and a larger number is > accepted as a back reference if at least that many subexpressions > exist at that point in the regular expression, otherwise the parser > will drop digits until the number is smaller or equal to the existing > number of groups or it is one digit. This is still true for reasons of backwards compatibility, but new code should always use constructs like \g{10} for the numbered group and \o{10} for the octal code point number to remove all doubt. > Perl uses the g flag to request a match that resumes where the last > match left off. This functionality is provided implicitly by the > Matcher class: Repeated invocations of the find method will resume > where the last match left off, unless the matcher is reset. > In Perl, embedded flags at the top level of an expression affect the > whole expression. In this class, embedded flags always take effect at > the point at which they appear, whether they are at the top level or > within a group; in the latter case, flags are restored at the end of > the group just as in Perl. > Perl is forgiving about malformed matching constructs, as in the > expression *a, as well as dangling brackets, as in the expression > abc], and treats them as literals. This class also accepts dangling > brackets but is strict about dangling metacharacters like +, ? and *, > and will throw a PatternSyntaxException if it encounters them. While there are indeed regex languages that work that way, Perl is thankfully not one of them: $ perl -le 'print if /*a/' Quantifier follows nothing in regex; marked by <-- HERE in m/* <-- HERE a/ at -e line 1. % perl -le 'print if /?/' Quantifier follows nothing in regex; marked by <-- HERE in m/? <-- HERE / at -e line 1. % perl -le 'print if /+/' Quantifier follows nothing in regex; marked by <-- HERE in m/+ <-- HERE / at -e line 1. $ perl -le 'print if /[abc/' Unmatched [ in regex; marked by <-- HERE in m/[ <-- HERE abc/ at -e line 1. The only release that I can find where something like *a was ever accepted by Perl is 1987's initial Perl 1.0 release: $ perl1 -e 'print "match\n" if "*a" =~ /*a/;' match Which is going on being a quarter-century out of date! I don't believe there has been a release of Perl since Java has even existed that accepted such things. Please don't cite things from more than 20 years ago. :(