A little while ago I developed a Raku program to look up file ownership in
a Github CODEOWNERS file.  My first attempt worked by converting the
glob-like CODEOWNERS patterns into regexes, eg:

app/views/**/*.rb --> /^ 'app/views/' [<-[/]>+]* % '/' '/' <-[/]>* '.rb' $/

Surprisingly, this approach was so slow as to be a non-starter, no matter
how I tweaked it.  So, I wrote another implementation that uses a recursive
subroutine to compare path components against the CODEOWNERS patterns.
Originally it was a series of multi subs, but I recast it as a single sub
when I thought that might make it faster:

sub matches(@strs, @patterns) {
    if @patterns ~~ ('',) {
        True;
    } elsif @patterns ~~ ('**', **) {
        ?grep { matches @strs[$_..*], @patterns[1..*] }, @strs.keys;
    } elsif @strs && @patterns {
        my $str = @strs[0];
        my $pattern = @patterns[0];
        do if $pattern.contains('*') {
            my @words = $pattern.split('*');
            $str.starts-with(@words[0]) && $str.ends-with(@words[*-1]) &&
sub has-words($start, [$word, *@rest]) {
                my $n = $str.index($word, $start);
                $n.defined && (!@rest || has-words($n + $word.chars, @rest))
            }(0, @words);
        } else {
            $str eq $pattern;
        } && matches @strs[1..*], @patterns[1..*];
    } else {
        @strs == @patterns == 0;
    }
}

Instead of using a regex to match the asterisk wildcard pattern, I split
the pattern into asterisk-delimited chunks, then check that the test string
starts with the first chunk, that it ends with the last one, and that the
chunks occur in the string in order and without overlapping.  (Real
CODEOWNERS patterns can have a question mark wildcard, but my company
doesn't use any and so I've omitted that case here for simplicity.)

This seemed to work well enough for me.  It could look up a single filename
in our ≈900 line CODEOWNERS file in about a second.  More than that to look
up all of the files in a modest-sized pull request, but still tolerable.
Then I decided I wanted to share my work with the rest of my company.  The
high-level language of choice here is Ruby, so I translated the above code
into it and tried it out.

I expected rough parity, but the Ruby version *killed* the original Raku
version.  Checking the ownership of a hundred files takes the Ruby version
about a quarter of a second; Raku takes *ten times as long.*

I was flabbergasted.  Although it seemed futile, I looked for ways to speed
up the Raku version.  Surprisingly, I found one right away.  I've rarely
employed smart-matching with arrays as I do with @patterns in the first two
branches above, but to make sure the Raku version had exactly the same
logic as in Ruby, I rewrote them:

if @patterns ~~ ('',) --> if @patterns == 1 && @patterns[0] eq ''
if @patterns ~~ ('**', **) --> if @patterns && @patterns[0] eq '**'

This change cuts the execution time in *half*.  Apparently list
smart-matching is *really, really expensive!*

So that was a big win, but Raku still took five times as long as the Ruby
version.  I couldn't see anything else obvious to try in the matches
subroutine, so I looked at the main loop over the CODEOWNERS lines.  Cut
down a bit, it looks like this:

for $repo-path.add('CODEOWNERS').lines.kv -> $n, $line {
    my ($pattern, @owners) = $line.words;
    my @patterns = $pattern.split('/');

    for %paths.kv -> $file, @path {
        if matches(@path, @patterns) {
            # record who owns this file
        }
    }
}

I tried skipping the bulk of the work entirely by changing if
matches(@path, @patterns) to if False.  Of course this shortens the
execution time by a lot...but it still takes twice as long as the Ruby
version that does all of the matching work.

So, yeah.  This was a very disappointing exercise.  I'm posting it here to
share my experience and to vent a bit, but if anyone can see anything I
missed that could significantly pare down the execution time, I'm all ears.

Reply via email to