On Sun, Feb 12, 2012 at 03:14:34AM +0100, Xavier Noria wrote: > On Sun, Feb 12, 2012 at 2:10 AM, Aaron Patterson > <[email protected]>wrote: > > Hi there! > > It's possible to count the number of captures in a given regexp: > > > > def count_captures regexp > > Regexp.union([regexp, //]).match('').captures.length > > end > > > > Yeah, Active Support had that some time ago. I think it was used in > routing, but it got removed sometime in the middle of writing the AS guide. > > > With that information, you can calculate offsets. > > > Which offsets do you have in mind here? > > Counting groups seems a good idea, we could get rid of the ugly _25 named > captures I think. > > Since individual inflection rules have well-formed regexps (or strings), we > can count their groups. Therefore, if we add one extra group enclosing the > whole regexp we can keep track of which group index corresponds to which > rule. That'd be much cleaner! > > > > > > This is the proof-of-concept: https://gist.github.com/1798985. > > > > Seems good, but you don't need a *args to Regexp.union. :-) > > > > Oh, yes :). > > > If you can rule out backreferences, it seems possible you could build a > > state machine that would perform in linear time to the string length. > > > > The situation now is: > > * Inflections can have arbitrary regexps (modulus backreferences if we > disallow them). A priori, they can use any regexp feature, and are > typically case-insensitive. > > * Inflections have to be matched in order. > > Given that and the fast regexp machine in 1.9, alternation seems to be a > better and cost-free approach because what we are talking here is > implemented quickly (cost == development cost). > > Wouldn't such a state machine be a rewrite of the very regexp engine?
Ya, but if we're going to put arbitrary restrictions on the type of matches people can do (i.e. no backreferences), you may as well use an engine that can execute in O(n) time (n = string length). Otherwise, what's the point? > It seems though, with your algorithm, as the string gets longer, the > > original implementation is faster: > > > > https://gist.github.com/1805359 > > > > Do we know what kind of data we're dealing with? Somewhere between > > strings of length 10 and strings of length 100, the original code > > outperforms this version. If we're only dealing with shorter strings, > > maybe this newer code would be better. > > > > Yes, I think this approach is worthwhile because in practice I believe > Rails and Rails applications singularize and pluralize English words. > > > name = md.names.detect {|n| md[n]} > > > > The length of `names` increases as the number of rules is added to the > > system. Which means in the case of a match, this algorithm will still > > be O(n) > > > Yes, yes, but seems unavoidable an O(n) that still outperforms the original > algorithm with the sizes I've tested (I don't mean it does not with lager > sets of inflections, only I measured the typical size which is what most > people use). How did you determine the length of the strings that most people use? I think seeing this data would be useful in speeding up this code. -- Aaron Patterson http://tenderlovemaking.com/
pgpJSkDaqIh9y.pgp
Description: PGP signature
