Re: Is this a regex bug?

ToddAndMargo via perl6-users Tue, 20 Jun 2023 15:45:17 -0700

On 6/20/23 13:32, Bruce Gray wrote:

On Jun 19, 2023, at 18:50, ToddAndMargo via perl6-users <perl6-users@perl.org> 
wrote:

Hi All,

Fedora 37
RakudoPkgFedora37-2023.05.01.x86_64.rpm
https://github.com/nxadm/rakudo-pkg/releases

The `/$0$1 $2/` is not coming out correct.
Is this a bug or did I do something wrong?


--snip--
# Is this a bug or did I do something wrong?

You did something wrong.

You said "FOO is not coming out correct", without telling us what "correct" you 
were expecting, or trying to obtain.
So, I am guessing what particular aspect of the output you are 
displeased/suprised by.

1. If the issue is that $x contains more than just "$0$1 $2",
the reason is that you are missing `.* $` at the end of your regex.

I was initially mystified by the unexpectedly-long result in `$x`;
I detected the source of the problem by changing your substitution
of /$0$1 $2/ to /aaa $0 bbb $1 ccc $2 ddd / for examination.
Just add `.* $` to the end of your pattern, and the s/// will destroy 
everything after the match,
as it did to the part before the match.

2. If the issue is that $1 contains `i686` and $2 contains `x86_64`
(which would be a suprising thing to want, and is better handled outside of a 
regex),
then the reason is that your regex specifies `a href=` *second* in each line,
while the actual data has the `a href=` occuring *first*. So, the pattern is 
matching across lines.

I (still guessing) have coded a quick parsing the way I might have done, and 
solutions to variants of (2.) above.

BTW:
     * I am unclear if the destructive effect of `$x ~~ s///` is needed for 
your intentions.
         If you are just trying to parse, `.match` seems better.
     * You included the input you are trying to match against, which was 
critial in understanding as much as I did.
         Did you alter the input data, though?
         Specifically, does your actual input lack newlines between the "lines" 
of data, the way your `$x` lacks them?
     * If your input data is line-oriented, and you can reliably break it up 
into .lines(), then doing that pre-segmenting
          and running a regex on each individual line is a better way of 
efficiently keeping the regex speedy
         and not accidentally spanning lines.
     * You have HTML in your input data.
         The reflexive advice every time this comes up is "Do not parse HTML with 
regex; use a HTML parser".
         (We all parse HTML with regex anyway, on occasion; regex are just so 
easy to reach for.)
         To encourge you to consider the "correct" approach, I have appended a 
solution in just 6 SLOC.



Hi Bruce,

Thank you for the technical writing feedback and
the Raku tips!

I use to use "for $WebPage.line -> $Line {...}",
but it is a lot easier to use regex.   (I do ADORE
the .lines  method though.)


Let me trying writing the question a little bit better.

The code in question:

    $ curl -L http://vpaste.net/pxRm6 -o -
    <RegexTest.pl6>
    #!/bin/raku

    print "\n";

my Str $x = Q[<ahref="wike-2.0.1-1.fc38.noarch.rpm">wike-2.0.1-1.fc38.noarch.rpm</a>27-Apr-2023 01:53 143K] ~Q[<ahref="wine-8.6-1.fc38.i686.rpm">wine-8.6-1.fc38.i686.rpm</a> 19-Apr-202321:48 11K] ~Q[<ahref="wine-8.6-1.fc38.x86_64.rpm">wine-8.6-1.fc38.x86_64.rpm</a>19-Apr-2023 21:48 11K] ~Q[<ahref="wine-alsa-8.6-1.fc38.i686.rpm">wine-alsa-8.6-1.fc38.i686.rpm</a>19-Apr-2023 21:48 223K] ~

                "\n\n";

$x~~s:i/ .*? ("wine") (.*?) $(Q[">] ) .*? $( Q[a href="] )(.*?) ( $(Q[">] ) ) /$0$1 $2/;

    print "0 = <$0>\n1 = <$1>\n2 = <$2>\n\n";
    print "$x\n\n";
</RegexTest.pl6>


Correct results of:
    print "0 = <$0>\n1 = <$1>\n2 = <$2>\n\n";
    0 = <wine>
    1 = <-8.6-1.fc38.i686.rpm>
    2 = <wine-8.6-1.fc38.x86_64.rpm>

The above is correct and what I wanted.  And shows that
$0, $1, and $2 contain the correct values I wanted.


Incorrect result of:

$x~~s:i/ .*? ("wine") (.*?) $(Q[">] ) .*? $( Q[a href="] )(.*?) ( $(Q[">] ) ) /$0$1 $2/;

wine-8.6-1.fc38.i686.rpmwine-8.6-1.fc38.x86_64.rpmwine-8.6-1.fc38.x86_64.rpm</a>19-Apr-2023 21:48 11K<ahref="wine-alsa-8.6-1.fc38.i686.rpm">wine-alsa-8.6-1.fc38.i686.rpm</a>19-Apr-2023 21:48 223K



Expected result of the above:
   wine-8.6-1.fc38.i686.rpm wine-8.6-1.fc38.x86_64.rpm


If you are wonder what I am up to with all this,
Wine-staging 7 and 8 under Fedora do not print
from Lotus Approach or Lotus Smart Suite.  The bug
has been report to and is currently being worked
on by the Wine folks.  Wine has since asked if I
would test Wine-Staging 8.10.  But if I
spin it myself I trash my rpm database.  So I am
waiting from my request to Fedora to add wine-staging
8.10 to the repos.

Now I could check DNF to see if 8.10 is in the
repos, but I have to remove my exclude from dnf.conf
and if I forget to set it back, I risk ruining my
functioning install of wine-staging 6.x.

So I wrote a program to
1) test my installed version of Wine-staging
2) test the current version of Wine-staging in the
   standard fedora repo
3) test the version in the fedora updates repo.

The result looks like this:

https://imgur.com/dointsEl.png

-T


--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Computers are like air conditioners.
They malfunction when you open windows
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Re: Is this a regex bug?

Reply via email to