Re: Is this a regex bug?

Bruce Gray Tue, 20 Jun 2023 13:33:07 -0700

> On Jun 19, 2023, at 18:50, ToddAndMargo via perl6-users 
> <perl6-users@perl.org> wrote:
> 
> Hi All,
> 
> Fedora 37
> RakudoPkgFedora37-2023.05.01.x86_64.rpm
> https://github.com/nxadm/rakudo-pkg/releases
> 
> The `/$0$1 $2/` is not coming out correct.
> Is this a bug or did I do something wrong?

--snip--
# Is this a bug or did I do something wrong?

You did something wrong.

You said "FOO is not coming out correct", without telling us what "correct" you 
were expecting, or trying to obtain.
So, I am guessing what particular aspect of the output you are 
displeased/suprised by.

1. If the issue is that $x contains more than just "$0$1 $2",
the reason is that you are missing `.* $` at the end of your regex.

I was initially mystified by the unexpectedly-long result in `$x`;
I detected the source of the problem by changing your substitution
of /$0$1 $2/ to /aaa $0 bbb $1 ccc $2 ddd / for examination.
Just add `.* $` to the end of your pattern, and the s/// will destroy 
everything after the match,
as it did to the part before the match.

2. If the issue is that $1 contains `i686` and $2 contains `x86_64`
(which would be a suprising thing to want, and is better handled outside of a 
regex),
then the reason is that your regex specifies `a href=` *second* in each line,
while the actual data has the `a href=` occuring *first*. So, the pattern is 
matching across lines.

I (still guessing) have coded a quick parsing the way I might have done, and 
solutions to variants of (2.) above.

BTW:
    * I am unclear if the destructive effect of `$x ~~ s///` is needed for your 
intentions.
        If you are just trying to parse, `.match` seems better.
    * You included the input you are trying to match against, which was critial 
in understanding as much as I did.
        Did you alter the input data, though?
        Specifically, does your actual input lack newlines between the "lines" 
of data, the way your `$x` lacks them?
    * If your input data is line-oriented, and you can reliably break it up 
into .lines(), then doing that pre-segmenting
         and running a regex on each individual line is a better way of 
efficiently keeping the regex speedy
        and not accidentally spanning lines.
    * You have HTML in your input data.
        The reflexive advice every time this comes up is "Do not parse HTML 
with regex; use a HTML parser".
        (We all parse HTML with regex anyway, on occasion; regex are just so 
easy to reach for.)
        To encourge you to consider the "correct" approach, I have appended a 
solution in just 6 SLOC.

-- 
Hope this helps,
Bruce Gray (Util of PerlMonks)

### Code:
    # This version of `$x` *does* embed newlines. The regex works either way.
    # See https://docs.raku.org/language/quoting#Heredocs:_:to
    my Str $x = q:to/END_OF_X/;
        <a href="wike-2.0.1-1.fc38.noarch.rpm">wike-2.0.1-1.fc38.noarch.rpm</a> 
27-Apr-2023 01:53  143K
        <a href="wine-8.6-1.fc38.i686.rpm">wine-8.6-1.fc38.i686.rpm</a> 
19-Apr-2023 21:48  11K
        <a href="wine-8.6-1.fc38.x86_64.rpm">wine-8.6-1.fc38.x86_64.rpm</a>     
            19-Apr-2023 21:48     11K
        <a 
href="wine-alsa-8.6-1.fc38.i686.rpm">wine-alsa-8.6-1.fc38.i686.rpm</a>  
19-Apr-2023 21:48  223K

        END_OF_X

    # Parse all the input, not just the `wine` entries.
    my $fedora_updates_re = rx{
        :i
        '<a href="'   (<-["]>+?)   '">'     # <a 
href="wine-8.6-1.fc38.i686.rpm">
                      (<-[<]>+?)            #          wine-8.6-1.fc38.i686.rpm
        '</a>'                              # </a>
        \s* \d\d?\-\w\w\w\-\d\d\d\d         # 27-Apr-2023
        \s+ \d\d\:\d\d                      # 01:53
        \s+ \d+\w+                          # 143K
        \s*
    };

    my @parsed;
    for $x.match(:g, $fedora_updates_re) {
        push @parsed, %( href => .[0].Str,
                         name => .[1].Str );
    }

    [eq] .<href name> or warn "??? Expected name and href to match: {.raku}" 
for @parsed;

    say 'Resulting Array of Hashes, but only the `wine` entries:';
    .say for @parsed.grep( *.<href>.starts-with('wine') );

    say 'Recreating cross-line effect of the original regex:';
    for @parsed.rotor(2 => -1) -> ( $this_line, $next_line ) {
        if $this_line<name>.starts-with('wine') {
            say $this_line<name>, "\t", $next_line<href>;
        }
    }

    say 'Original s/// with \$1 captured as part of \$0, and name/href reversed 
in the regex:';
    # Substitution seems like the wrong tool for this job, but this reduces $x 
to just the initial capture.
    $x ~~ s:i/
      ^
      .*?
        '<a href="'  (wine <-["]>+)  '">'
                     (wine <-[<]>+)  '</a>'
      .*
      $
    /$0 $1/;
    say $x;

### Output:
    Resulting Array of Hashes, but only the `wine` entries:
    {href => wine-8.6-1.fc38.i686.rpm, name => wine-8.6-1.fc38.i686.rpm}
    {href => wine-8.6-1.fc38.x86_64.rpm, name => wine-8.6-1.fc38.x86_64.rpm}
    {href => wine-alsa-8.6-1.fc38.i686.rpm, name => 
wine-alsa-8.6-1.fc38.i686.rpm}
    Recreating cross-line effect of the original regex:
    wine-8.6-1.fc38.i686.rpm    wine-8.6-1.fc38.x86_64.rpm
    wine-8.6-1.fc38.x86_64.rpm  wine-alsa-8.6-1.fc38.i686.rpm
    Original s/// with \$1 captured as part of \$0, and name/href reversed in 
the regex:
    wine-8.6-1.fc38.i686.rpm wine-8.6-1.fc38.i686.rpm


### Code using a HTML parser:
    use DOM::Tiny;
    # After curl -o w.html 
https://mirrors.aliyun.com/fedora/updates/38/Everything/x86_64/Packages/w/
    my @parsed = DOM::Tiny.parse( 'w.html'.IO.slurp ).find( 'tbody > tr' ).map: 
-> $tr {
        %(
            href => $tr.at( 'td[class="link"] > a' ).attr('href'),
            name => $tr.at( 'td[class="link"] > a' ).text(:trim),
            size => $tr.at( 'td[class="size"]'     ).text(:trim),
            date => $tr.at( 'td[class="date"]'     ).text(:trim),
        );
    }
    .say for @parsed.grep( *.<name>.starts-with('wine') );

### Output:
    {date => 2023-04-20 05:48, href => wine-8.6-1.fc38.i686.rpm, name => 
wine-8.6-1.fc38.i686.rpm, size => 10.5 KB}
    {date => 2023-04-20 05:48, href => wine-8.6-1.fc38.x86_64.rpm, name => 
wine-8.6-1.fc38.x86_64.rpm, size => 10.8 KB}
    {date => 2023-04-20 05:48, href => wine-alsa-8.6-1.fc38.i686.rpm, name => 
wine-alsa-8.6-1.fc38.i686.rpm, size => 222.7 KB}
    {date => 2023-04-20 05:48, href => wine-alsa-8.6-1.fc38.x86_64.rpm, name => 
wine-alsa-8.6-1.fc38.x86_64.rpm, size => 226.8 KB}
Re: Is this a regex bug?

Reply via email to