> On Jun 19, 2023, at 18:50, ToddAndMargo via perl6-users
> <perl6-users@perl.org> wrote:
>
> Hi All,
>
> Fedora 37
> RakudoPkgFedora37-2023.05.01.x86_64.rpm
> https://github.com/nxadm/rakudo-pkg/releases
>
> The `/$0$1 $2/` is not coming out correct.
> Is this a bug or did I do something wrong?
--snip--
# Is this a bug or did I do something wrong?
You did something wrong.
You said "FOO is not coming out correct", without telling us what "correct" you
were expecting, or trying to obtain.
So, I am guessing what particular aspect of the output you are
displeased/suprised by.
1. If the issue is that $x contains more than just "$0$1 $2",
the reason is that you are missing `.* $` at the end of your regex.
I was initially mystified by the unexpectedly-long result in `$x`;
I detected the source of the problem by changing your substitution
of /$0$1 $2/ to /aaa $0 bbb $1 ccc $2 ddd / for examination.
Just add `.* $` to the end of your pattern, and the s/// will destroy
everything after the match,
as it did to the part before the match.
2. If the issue is that $1 contains `i686` and $2 contains `x86_64`
(which would be a suprising thing to want, and is better handled outside of a
regex),
then the reason is that your regex specifies `a href=` *second* in each line,
while the actual data has the `a href=` occuring *first*. So, the pattern is
matching across lines.
I (still guessing) have coded a quick parsing the way I might have done, and
solutions to variants of (2.) above.
BTW:
* I am unclear if the destructive effect of `$x ~~ s///` is needed for your
intentions.
If you are just trying to parse, `.match` seems better.
* You included the input you are trying to match against, which was critial
in understanding as much as I did.
Did you alter the input data, though?
Specifically, does your actual input lack newlines between the "lines"
of data, the way your `$x` lacks them?
* If your input data is line-oriented, and you can reliably break it up
into .lines(), then doing that pre-segmenting
and running a regex on each individual line is a better way of
efficiently keeping the regex speedy
and not accidentally spanning lines.
* You have HTML in your input data.
The reflexive advice every time this comes up is "Do not parse HTML
with regex; use a HTML parser".
(We all parse HTML with regex anyway, on occasion; regex are just so
easy to reach for.)
To encourge you to consider the "correct" approach, I have appended a
solution in just 6 SLOC.
--
Hope this helps,
Bruce Gray (Util of PerlMonks)
### Code:
# This version of `$x` *does* embed newlines. The regex works either way.
# See https://docs.raku.org/language/quoting#Heredocs:_:to
my Str $x = q:to/END_OF_X/;
<a href="wike-2.0.1-1.fc38.noarch.rpm">wike-2.0.1-1.fc38.noarch.rpm</a>
27-Apr-2023 01:53 143K
<a href="wine-8.6-1.fc38.i686.rpm">wine-8.6-1.fc38.i686.rpm</a>
19-Apr-2023 21:48 11K
<a href="wine-8.6-1.fc38.x86_64.rpm">wine-8.6-1.fc38.x86_64.rpm</a>
19-Apr-2023 21:48 11K
<a
href="wine-alsa-8.6-1.fc38.i686.rpm">wine-alsa-8.6-1.fc38.i686.rpm</a>
19-Apr-2023 21:48 223K
END_OF_X
# Parse all the input, not just the `wine` entries.
my $fedora_updates_re = rx{
:i
'<a href="' (<-["]>+?) '">' # <a
href="wine-8.6-1.fc38.i686.rpm">
(<-[<]>+?) # wine-8.6-1.fc38.i686.rpm
'</a>' # </a>
\s* \d\d?\-\w\w\w\-\d\d\d\d # 27-Apr-2023
\s+ \d\d\:\d\d # 01:53
\s+ \d+\w+ # 143K
\s*
};
my @parsed;
for $x.match(:g, $fedora_updates_re) {
push @parsed, %( href => .[0].Str,
name => .[1].Str );
}
[eq] .<href name> or warn "??? Expected name and href to match: {.raku}"
for @parsed;
say 'Resulting Array of Hashes, but only the `wine` entries:';
.say for @parsed.grep( *.<href>.starts-with('wine') );
say 'Recreating cross-line effect of the original regex:';
for @parsed.rotor(2 => -1) -> ( $this_line, $next_line ) {
if $this_line<name>.starts-with('wine') {
say $this_line<name>, "\t", $next_line<href>;
}
}
say 'Original s/// with \$1 captured as part of \$0, and name/href reversed
in the regex:';
# Substitution seems like the wrong tool for this job, but this reduces $x
to just the initial capture.
$x ~~ s:i/
^
.*?
'<a href="' (wine <-["]>+) '">'
(wine <-[<]>+) '</a>'
.*
$
/$0 $1/;
say $x;
### Output:
Resulting Array of Hashes, but only the `wine` entries:
{href => wine-8.6-1.fc38.i686.rpm, name => wine-8.6-1.fc38.i686.rpm}
{href => wine-8.6-1.fc38.x86_64.rpm, name => wine-8.6-1.fc38.x86_64.rpm}
{href => wine-alsa-8.6-1.fc38.i686.rpm, name =>
wine-alsa-8.6-1.fc38.i686.rpm}
Recreating cross-line effect of the original regex:
wine-8.6-1.fc38.i686.rpm wine-8.6-1.fc38.x86_64.rpm
wine-8.6-1.fc38.x86_64.rpm wine-alsa-8.6-1.fc38.i686.rpm
Original s/// with \$1 captured as part of \$0, and name/href reversed in
the regex:
wine-8.6-1.fc38.i686.rpm wine-8.6-1.fc38.i686.rpm
### Code using a HTML parser:
use DOM::Tiny;
# After curl -o w.html
https://mirrors.aliyun.com/fedora/updates/38/Everything/x86_64/Packages/w/
my @parsed = DOM::Tiny.parse( 'w.html'.IO.slurp ).find( 'tbody > tr' ).map:
-> $tr {
%(
href => $tr.at( 'td[class="link"] > a' ).attr('href'),
name => $tr.at( 'td[class="link"] > a' ).text(:trim),
size => $tr.at( 'td[class="size"]' ).text(:trim),
date => $tr.at( 'td[class="date"]' ).text(:trim),
);
}
.say for @parsed.grep( *.<name>.starts-with('wine') );
### Output:
{date => 2023-04-20 05:48, href => wine-8.6-1.fc38.i686.rpm, name =>
wine-8.6-1.fc38.i686.rpm, size => 10.5 KB}
{date => 2023-04-20 05:48, href => wine-8.6-1.fc38.x86_64.rpm, name =>
wine-8.6-1.fc38.x86_64.rpm, size => 10.8 KB}
{date => 2023-04-20 05:48, href => wine-alsa-8.6-1.fc38.i686.rpm, name =>
wine-alsa-8.6-1.fc38.i686.rpm, size => 222.7 KB}
{date => 2023-04-20 05:48, href => wine-alsa-8.6-1.fc38.x86_64.rpm, name =>
wine-alsa-8.6-1.fc38.x86_64.rpm, size => 226.8 KB}