Re: j.u.r.Pattern documentation errors

Xueming Shen Sun, 23 Jan 2011 23:25:07 -0800

Thanks Tom.

That part of doc definitely need re-visit, it was written before 2002(probably isagainst Perl 5.6) and have not been touched since, lots are no longertrue given

the latest 5.12.


-Sherman

On 1-23-2011 14:14 02:14 PM, Tom Christiansen wrote:

In this message I cover only those errors made in the final
section ("Comparison to Perl 5") of:

     http://download.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

I really hope no one is offended by this.  I don't mean to be
a nitpicker.  Technical errors in the documentation should be
very very easy to correct, since no code change is required.

========================================

     Comparison to Perl 5
     The Pattern engine performs traditional NFA-based matching
     with ordered alternation as occurs in Perl 5.
     Perl constructs not supported by this class:
     * The conditional constructs (?{X}) and (?(condition)X|Y),

That should instead read:

     The conditional constructs (?(condition)X) and (?(condition)X|Y),

     * The embedded code constructs (?{code}) and (??{code}),
     * The embedded comment syntax (?#comment), and
     * The preprocessing operations \l \u, \L, and \U.

That is no longer true, as Java supports those now.

There is quite a bit missing from the list of Perl constructs
unsupported by this class.

     * Perl regex escapes: \x{...}, \R, \h, \H, \v, \V,
        \X, \N, \N{...}, \K, and recently \o{...}.
        [NB: My rewrite library covers the top row.]

     * Relative buffers like \g{-2} for $-2, or the \g{NAME}
       alias for a named backref \k<NAME>.

     * The branch-reset operator: (?|...)

     * Buffer recursion (?0) (?1) (?&NAME) etc to allow
       recursive regexes; e.g. \((?:[^()]*+|(?0))*\) matches
       nested parens.

     * Non-executing definition-only blocks via (?(DEFINE)...)
       to allow the separation of execution from declaration.
       See post-sig example.

     * Backtracking control verbs like (*MARK:NAME), (*FAIL), (*SKIP)

========================================

     Constructs supported by this class but not by Perl:
     * Possessive quantifiers, which greedily match as much as
       they can and do not back off, even when doing so would
       allow the overall match to succeed.

This is not true.  Perl understands the same possessive
quantifiers that Java does.

     * Character-class union and intersection as described above.

True.  In Perl you have to use lookahead assertions to effect
the same end.

     Notable differences from Perl:

I would certainly put these two in the very front of this section:

     * Perl's charclass shortcuts all work **VERY DIFFERENTLY** from
       Java's, including \w \W \s \S \d \D \b \B.  [NOTE: my rewrite
       library fixes this.]

     * Perl supports all official Unicode properties, and follows
       all strong recommendations in tr18, whereas Java does neither.

     * In Perl, \1 through \9 are always interpreted as back
       references; a backslash-escaped number greater than 9 is
       treated as a back reference if at least that many
       subexpressions exist, otherwise it is interpreted, if
       possible, as an octal escape. In this class octal escapes
       must always begin with a zero. In this class, \1 through \9
       are always interpreted as back references, and a larger
       number is accepted as a back reference if at least that
       many subexpressions exist at that point in the regular
       expression, otherwise the parser will drop digits until the
       number is smaller or equal to the existing number of groups
       or it is one digit.

I think it more important to state that Perl does not require a 0,
and so \377 is an octal 0xFF.  BTW, the new \o{...} is unambiguously
an octal escape just as \g{...} is unambiguously a backref group.

     * Perl uses the g flag to request a match that resumes where
       the last match left off. This functionality is provided
       implicitly by the Matcher class: Repeated invocations of
       the find method will resume where the last match left off,
       unless the matcher is reset.

I wish there were mention that the Matcher.matches() method adds
implicit boundaries, while Perl does not.

Russ Cox's strategy for RE (re)names that method matches_exactly(),
to better express what it does and clear up confusion.

     * In Perl, embedded flags at the top level of an expression
       affect the whole expression. In this class, embedded flags
       always take effect at the point at which they appear,
       whether they are at the top level or within a group; in the
       latter case, flags are restored at the end of the group
       just as in Perl.
     * Perl is forgiving about malformed matching constructs, as
       in the expression *a, as well as dangling brackets, as in
       the expression abc], and treats them as literals. This
       class also accepts dangling brackets but is strict about
       dangling metacharacters like +, ? and *, and will throw a
       PatternSyntaxException if it encounters them.

This is incorrect; Perl is not forgiving about malformed matching
constructs like the one cited above:

     % perl -e '/*a/'
     Quantifier follows nothing in regex; marked by<-- HERE in m/*<-- HERE a/ 
at -e line 1.

Perl also supports user-defined character name aliases for
\N{...} and user-defined character properties for \p{...} and
\P{...}, but Java supports neither.  Java doesn't even support
character names at all that I can see, and Java definitely
doesn't support the full complement of character properties as
defined by the Unicode Character Database; Perl does.

I believe that Java does not supported named character sequences,
which are part of what it takes to support Unicode 6.0 as they
are new to that release.

There may be more than this, but it's what came immediately
to mind.

Hope this helps!!

--tom

PS: Here's an example of using (?(DEFINE)...) to completely parse
     an RFC 5322 email address, including nested comments. Notice
     how much like a BNF grammar this now becomes. It's a Perl 5
     thing that we backported from Perl 6: very clean, even beautiful.

     $rfc5322 = qr{
        (?(DEFINE)
         (?<address>          (?&mailbox) | (?&group))
         (?<mailbox>          (?&name_addr) | (?&addr_spec))
         (?<name_addr>        (?&display_name)? (?&angle_addr))
         (?<angle_addr>       (?&CFWS)?<  (?&addr_spec)>  (?&CFWS)?)
         (?<group>            (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; 
(?&CFWS)?)
         (?<display_name>     (?&phrase))
         (?<mailbox_list>     (?&mailbox) (?: , (?&mailbox))*)

         (?<addr_spec>        (?&local_part) \@ (?&domain))
         (?<local_part>       (?&dot_atom) | (?&quoted_string))
         (?<domain>           (?&dot_atom) | (?&domain_literal))
         (?<domain_literal>   (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                       \] (?&CFWS)?)
         (?<dcontent>         (?&dtext) | (?&quoted_pair))
         (?<dtext>            (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

         (?<atext>            (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
         (?<atom>             (?&CFWS)? (?&atext)+ (?&CFWS)?)
         (?<dot_atom>         (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
         (?<dot_atom_text>    (?&atext)+ (?: \. (?&atext)+)*)

         (?<text>             [\x01-\x09\x0b\x0c\x0e-\x7f])
         (?<quoted_pair>      \\ (?&text))

         (?<qtext>            (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
         (?<qcontent>         (?&qtext) | (?&quoted_pair))
         (?<quoted_string>    (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                              (?&FWS)? (?&DQUOTE) (?&CFWS)?)

         (?<word>             (?&atom) | (?&quoted_string))
         (?<phrase>           (?&word)+)

         # Folding white space
         (?<FWS>              (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
         (?<ctext>            (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
         (?<ccontent>         (?&ctext) | (?&quoted_pair) | (?&comment))
         (?<comment>          \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
         (?<CFWS>             (?: (?&FWS)? (?&comment))*
                             (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

         # No whitespace control
         (?<NO_WS_CTL>        [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

         (?<ALPHA>            [A-Za-z])
         (?<DIGIT>            [0-9])
         (?<CRLF>             \x0d \x0a)
         (?<DQUOTE>           ")
         (?<WSP>              [\x20\x09])
        )

        (?&address)

     }x;

Re: j.u.r.Pattern documentation errors

Reply via email to