On Sun, Feb 12, 2012 at 2:10 AM, Aaron Patterson
<[email protected]>wrote:

Hi there!

It's possible to count the number of captures in a given regexp:
>
>    def count_captures regexp
>      Regexp.union([regexp, //]).match('').captures.length
>    end
>

Yeah, Active Support had that some time ago. I think it was used in
routing, but it got removed sometime in the middle of writing the AS guide.


With that information, you can calculate offsets.


Which offsets do you have in mind here?

Counting groups seems a good idea, we could get rid of the ugly _25 named
captures I think.

Since individual inflection rules have well-formed regexps (or strings), we
can count their groups. Therefore, if we add one extra group enclosing the
whole regexp we can keep track of which group index corresponds to which
rule. That'd be much cleaner!



> > This is the proof-of-concept: https://gist.github.com/1798985.
>
> Seems good, but you don't need a *args to Regexp.union. :-)
>

Oh, yes :).


If you can rule out backreferences, it seems possible you could build a
> state machine that would perform in linear time to the string length.
>

The situation now is:

* Inflections can have arbitrary regexps (modulus backreferences if we
disallow them). A priori, they can use any regexp feature, and are
typically case-insensitive.

* Inflections have to be matched in order.

Given that and the fast regexp machine in 1.9, alternation seems to be a
better and cost-free approach because what we are talking here is
implemented quickly (cost == development cost).

Wouldn't such a state machine be a rewrite of the very regexp engine?


It seems though, with your algorithm, as the string gets longer, the
> original implementation is faster:
>
>  https://gist.github.com/1805359
>
> Do we know what kind of data we're dealing with?  Somewhere between
> strings of length 10 and strings of length 100, the original code
> outperforms this version.  If we're only dealing with shorter strings,
> maybe this newer code would be better.
>

Yes, I think this approach is worthwhile because in practice I believe
Rails and Rails applications singularize and pluralize English words.


 name = md.names.detect {|n| md[n]}
>
> The length of `names` increases as the number of rules is added to the
> system.  Which means in the case of a match, this algorithm will still
> be O(n)


Yes, yes, but seems unavoidable an O(n) that still outperforms the original
algorithm with the sizes I've tested (I don't mean it does not with lager
sets of inflections, only I measured the typical size which is what most
people use).



> where n is the number of rules defined.  We can see this by
> testing strings that match first and strings that match last:
>
>  https://gist.github.com/1805536
>
> I'm using AS 3.2.1, so "zombies" will be best case, and a bunch of x's
> will be worst.
>

Yes.

But we are not really interested in arbitrary strings. If we give a 5x
boost to the vast majority of users that apply inflection to English words
of ordinary length, and some minority using long strings get some penalty,
that's fine I think.

-- 
You received this message because you are subscribed to the Google Groups "Ruby 
on Rails: Core" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/rubyonrails-core?hl=en.

Reply via email to