On Thu, Oct 18, 2012 at 2:51 AM, Matthew Kerwin <[email protected]> wrote:
> Tangentially, it just occurred to me that ruby's regular expression
> engine does the same thing that  javascript's does, when globally
> replacing /X*$/ .

This behavior is common with most regexp engines (at least I don't
know any which does _not_ behave like this).  All regular expressions
X* can match the empty string - anywhere in the input.

irb(main):022:0> "####".scan /\w*/
=> ["", "", "", "", ""]

And, when anchoring a portion of the match expression at the end and
have repetition in that match you need to make sure that the
characters are not eaten by other parts of the regexp.

"naive" approach:

irb(main):026:0> %w{aaa aab abb bbb}.each {|s| /.*(b*)\z/ =~ s; printf
"%p: 1:%p\n", s, $1}
"aaa": 1:""
"aab": 1:""
"abb": 1:""
"bbb": 1:""
=> ["aaa", "aab", "abb", "bbb"]

Working approaches:

1. reduce greed

irb(main):027:0> %w{aaa aab abb bbb}.each {|s| /.*?(b*)\z/ =~ s;
printf "%p: 1:%p\n", s, $1}
"aaa": 1:""
"aab": 1:"b"
"abb": 1:"bb"
"bbb": 1:"bbb"
=> ["aaa", "aab", "abb", "bbb"]

2. negative lookbehind

irb(main):028:0> %w{aaa aab abb bbb}.each {|s| /.*(?<!b)(b*)\z/ =~ s;
printf "%p: 1:%p\n", s, $1}
"aaa": 1:""
"aab": 1:"b"
"abb": 1:"bb"
"bbb": 1:"bbb"
=> ["aaa", "aab", "abb", "bbb"]

Note though the special case where there is only one alternative with
a match anchored at the end:

irb(main):045:0> for b in body; for pre in segm; for post in segm;
s="#{pre}#{b}#{post}"; printf "%p -> %p\n",s,s[/#*\z/]; end end end
"" -> ""
"#" -> "#"
"##" -> "##"
"#" -> "#"
"##" -> "##"
"###" -> "###"
"##" -> "##"
"###" -> "###"
"####" -> "####"
"foo" -> ""
"foo#" -> "#"
"foo##" -> "##"
"#foo" -> ""
"#foo#" -> "#"
"#foo##" -> "##"
"##foo" -> ""
"##foo#" -> "#"
"##foo##" -> "##"
=> ["", "foo"]

Here, the simple expression works since the # are not eaten by other
portions of the regexp.

>  It arose when someone wanted to replace any number
> (or none) of a character at the start and end of a string with exactly
> one of that character.
>
> irb(main):001:0> 'foo'.gsub(/\A#*|#*\Z/, '#')
> => "#foo#"
> irb(main):002:0> '#foo'.gsub(/\A#*|#*\Z/, '#')
> => "#foo#"
> irb(main):003:0> '##foo'.gsub(/\A#*|#*\Z/, '#')
> => "#foo#"
> irb(main):004:0> 'foo#'.gsub(/\A#*|#*\Z/, '#')
> => "#foo##"
> irb(main):005:0> 'foo##'.gsub(/\A#*|#*\Z/, '#')
> => "#foo##"
> irb(main):006:0> '##foo##'.gsub(/\A#*|#*\Z/, '#')
> => "#foo##"

If one regexp should be used in this case the negative lookbehind is a
viable option since there is no preceding part in this alternative
which we can make non greedy:

irb(main):044:0> for b in body; for pre in segm; for post in segm;
s="#{pre}#{b}#{post}"; printf "%p -> %p\n",s,s.gsub(/\A#*|(?<!#)#*\z/,
'#'); end end end
"" -> "#"
"#" -> "#"
"##" -> "#"
"#" -> "#"
"##" -> "#"
"###" -> "#"
"##" -> "#"
"###" -> "#"
"####" -> "#"
"foo" -> "#foo#"
"foo#" -> "#foo#"
"foo##" -> "#foo#"
"#foo" -> "#foo#"
"#foo#" -> "#foo#"
"#foo##" -> "#foo#"
"##foo" -> "#foo#"
"##foo#" -> "#foo#"
"##foo##" -> "#foo#"
=> ["", "foo"]

> I blogged about it here:
> http://matthew.kerwin.net.au/blog/20110608_javascript_global_regexp

Turns out with Oniguruma there *is* a way to do it with a single
regexp.  In fact any regexp engine with lookbehind will do.

Reference: http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

Kind regards

robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

-- You received this message because you are subscribed to the Google Groups 
ruby-talk-google group. To post to this group, send email to 
[email protected]. To unsubscribe from this group, send email 
to [email protected]. For more options, visit this 
group at https://groups.google.com/d/forum/ruby-talk-google?hl=en

Reply via email to