Sergey, The way ORO's "matches" function works in this case coincides with the way perl regex engine works. While foo|foot example may seem odd to someone new to regular expressions, it is a well understood fact about perl's matching mechanics to many people. The simple rule about perl's regex that may help is: Perl's alternations are not greedy. Knowing this, and understanding the basic workings of traditional NFA engines should help explain why foo|foot will match "foo" in "football" and not "foot" as one may expect. Other types of engines (DFA or Posix NFA) will always match the longest of the leftmost, in this case "foot".
Traditional NFA such as perl starts with the regular expression's first character and tries to match the first character in the text. If you have expression \d{3}|d{3}\.\d{2} and text "123.45", perl will look at \d{3} first and see if it can match it to "1" in "123". To have a successful match, at least one permutation of the regular expression must be matched against the text. What's different about traditional NFA is that the first permutation that matches is good enough. Knowing this can be pretty valuable since you can craft your expressions so that the permutations which would match fastest are tried first. In many cases, fine tuned perl regex can outperform a DFA regex which keeps track of all matches so far until it finds the longest. Another important thing to note is that all quantifiers (like * and ? ) are greedy. With this in mind, one can achieve a greedy alternation in traditional NFA by using ? quantifier. If you had an expression like he(ll|llo) and text "hello" you could match "hello" if you rewrite the expression as he(ll(llo)?)? However, I would still re-write the expression as he(llo|ll), as long as I understand that it translates to "Match hello if you can, if not, try to match hell". This is not the same as "Match the longest of either hell or hello". If you think about it, the last sentence is really semantically equivalent to "Match the longest of either hello or hell" which is also the syntax that perl expects for such semantic interpretation. DFA and Posix will try to match text to the regular expression. So in your case, they'll take "1" in "123.45" and try to match it to \{d3}|d{3}\.d{2}. The result will always be the same, no matter how you order your alternations, as the longst match wins. Clearly two different approaches to matching. This said, asking for ORO's matcher to have greedy alternations would be asking for a completely different flavor of the regex engine inside ORO. Finally, making this type of change would break many applications which currently rely on perl's regex semantics. Regards, -Rob On Fri, 13 May 2005 23:11:27 +0400, Sergey Samokhodkin wrote > Hello Daniel! > > Friday, May 6, 2005, 11:16:54 PM, you wrote: > > DFS> ..... > DFS> The heart of the matter seems to be a difference in > expectations. I > > Of course, but isn't my expectation *natural*? > > DFS> understand why you could expect matches() to behave that way. > However, DFS> its documentation explains that it's not the same as > ^pattern$. I'll > > In fact, it only states the difference without any real explanation. > Let me cite: > > matches() literally looks for an exact match according to the rules > > of Perl5 expression matching. Therefore, if you have a pattern > > foo|foot and are matching the input foot it will not produce an exact match > > How "therefore"??? > Anyone who finds it clear (esp. Kevin Markey), please guess which of > the following is true: > > /foot?/ matches "foot" > /foot?/ matches "foo" > /foot??/ matches "foot" > /foot??/ matches "foo" > > DFS> matches() tests whether or not a pattern matches the input it > is given. DFS> This means that the matching process must start at > the beginning of DFS> the input and stop at the end of the input. > If the matching process stops DFS> before the end of the input, then > there's no match. The method answers DFS> the question "Is this > input character sequence a member of the set of all DFS> the > character sequences matched by this pattern?" > > Ooops! > The matching set for "foo|foot" is {"foo","foot"}. > The matching set for "foot|foo" is ***the same***. Order doesn't > matter in sets. > > DFS> It may make more sense thinking about it this way. matches() > returns true DFS> if and only if S =~ m/(P)/ is true and $1 equals > S. For example: > > DFS> sub matches(@) { > DFS> my ($pat, $str) = @_; > DFS> $str =~ m/($pat)/; > DFS> return ($str eq $1); > DFS> } > > DFS> printf "%d\n%d\n", matches("foo|foot", "foo"), > matches("foo|foot", "foot"); > > Yes, that's it. Something like that had to be in the docs. > > DFS> In my opinion, the important thing is for the behavior to be documented. > DFS> If it's not sufficiently clear, then we ought to make it more clear. > DFS> Documentation patches are welcome. > > DFS> Now, one can argue that we should add a validate() method specifically > DFS> for input validation with the behavior you expected. My opinion > > I'd say that the best method would be the "matches()" itself (see my > first remark). > Otherwise the question is closed. > Thanks a lot for your patience. > > DFS> daniel > > -- > Best regards, > Sergey > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]