Re: Regexes from XS

Marvin Humphrey Fri, 19 May 2006 22:28:41 -0700

On May 19, 2006, at 4:53 AM, [EMAIL PROTECTED] wrote:

:It seems the PMOP stores some flags which affect how pregcomp()
:behaves.  In this case, it appears that pregcomp() needs to know that
:UTF-8 is in effect.  Comments elsewhere in tkGlue.c indicate that any
:string coming from Tk will be UTF-8.

Yes, you probably want to avoid that if you don't need utf8.

I assume that this means the *pattern* will be interpreted as if itwas in UTF-8, which would most likely happen within the scope of a'use utf8;' pragma (hence your spelling of 'utf8'). The behavior ofthe regex against the string to be matched has to be determined toolate for pregcomp(), since it is based on the value of each scalar'sSvUTF8 flag. IOW, regardless of the conditions under which theREGEXP* struct was prepared, it has to be ready to deal with eitherUTF-8 scalars or byte-oriented scalars.

It makes my head hurt to consider what would happen if a UTF8 scalarhad to be interpolated into a pattern outside the scope of a utf8pragma.


    $foo =~ qr/stuff$a_utf8_string/;

I imagine you'd have to perform scalar concatenation, then set theUTF-8 flag on the PMOP based on the value of the UTF8 flag of theconcatenated string.

I'm starting to understand why this stuff isn't in the officialpublic API. :)

In my case, what I'll really need to do is retrieve a precompiledregular expression from within a passed-in qr// construct. That'sanother headache. Maybe I think it means I don't have to worry aboutpregcomp() at all, though.

I'm not
:sure under what circumstances it would be useful to set minend to
:something other than 0, but maybe for the tokenizer it should be 1.

I think this is there to handle the avoidance of infinite zero-length
matches - after a zero-length match at position p, the next match
must end *after* position p. This allows correct behaviour for things
like:
  "abc" =~ /( | . )/xg;
returns: ("", "a", "", "b", "", "c", "").

Yes. Snooping the code of pp_match in pp_hot.c, I see that thevariable minmatch starts out as 0, but if global matching is ineffect, it can be reset to something else on subsequent loops.

When doing //g matches, supply 0 for the first call; for subsequentcalls
supply (I think) $+[0] + (matchlen == 0 ? 1 : 0).


I've decided that I can simplify my tokenizing algo:

sub tokenize {
    my ( $token_re, $source_text ) = @_;
        # accumulate token start_offsets and end_offsets
        my ( @starts, @ends );
        1 while (
            m/$token_re/g
            and push @starts, $-[0]
            and push @ends, $+[0]
        );

        # add the new tokens to the batch
        add_many_tokens( $_, [EMAIL PROTECTED], [EMAIL PROTECTED] );
    }
}

Unfortunately, experimenting with this uncovered a bug in my algo. @+, @-, and pos() all give answers in terms of characters if thescalar which matched was marked with SvUTF8. But I always need@starts and @ends measured in *bytes*.

At the C level, I can get at that information using the startp andendp members of the regexp struct. Unfortunately, that's a deeperviolation of the private API than I intended. There are two levelsof hackery here: there's naughty, and then there's evil. Usingpregcomp() and pregexec() is naughty. Using rx->endp[0] is evil.

Oh well. I'm in too deep to quit now. At least I'm learning a lot.See below for a demo app that manages to successfully match once. Ihaven't figured out how to turn on global matching though, and Idefinitely need that for the tokenizer.

:One of these days I'll figure out what a "screaming" SV is, but it's
:clear from the Tk example that it can simply be the SV that to which
:strarg belongs.

This is the Boyer-Moore optimisation. Off the top of my head, it is
applied only when you 'study()' the target string - this upgrades the
string to a different type (SVt_PVBM) which adds a structure giving
a frequency table and linked lists of occurrences of each character in

the target string. This is useful only when you have one string towhich

you plan to apply many patterns.

Interesting. I wonder why Tk bothers with it, then, since it lookslike the matching from Tk is all one-shot and the SV* gets discarded.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


#!/usr/bin/perl
use strict;
use warnings;
use Inline C => <<'END_C';

SV*
regex_once(SV *regex_ref, SV *text) {
    SV *regex_sv;
    regexp *rx;
    char *stringarg, *stringbeg, *stringend;
    MAGIC *mg = NULL;

    if (!SvROK(regex_ref)) croak("not a ref");
    regex_sv = SvRV(regex_ref);
    if (!SvMAGICAL(regex_sv)) croak("No magic");
    mg       = mg_find(regex_sv, PERL_MAGIC_qr);
    rx       = (REGEXP*)mg->mg_obj;

    stringbeg = SvPV_nolen(text);
    stringarg = stringbeg;
    stringend = SvEND(text);

    pregexec(rx, stringarg, stringend, stringbeg, 1, text, 1);

    return newSVpv(stringbeg, rx->endp[0]);
}

void
regex_many(SV *either, SV *text) {
    PMOP *pm;
    SV *regex_sv;
    regexp *rx;
    char *stringarg, *stringbeg, *stringend;
    MAGIC *mg = NULL;
    int safety = 0;

    New(1, pm, 1, PMOP);
    pm->op_pmflags |= PMf_GLOBAL;

    if (SvROK(either)) {
        regex_sv = SvRV(either);
        if (!SvMAGICAL(regex_sv)) croak("No magic");
        mg       = mg_find(regex_sv, PERL_MAGIC_qr);
        rx       = (REGEXP*)mg->mg_obj;
    }
    else {
        if (!SvPOK(either)) croak("need a pattern");
        rx = pregcomp(SvPVX(either), SvEND(either), pm);
    }

    stringbeg = SvPV_nolen(text);
    stringarg = stringbeg;
    stringend = SvEND(text);

    while (pregexec(rx, stringarg, stringend, stringbeg, 1, text, 1)) {
        stringarg = stringbeg + rx->endp[0];
        fprintf(stderr, "%d\n", safety);
        if (safety++ > 10)
            break;
    }

    return newSVpv(stringbeg, rx->endp[0]);
}

END_C

my $regex = qr/../;
my $string = join '', 'a' .. 'z';
my $matched = regex_once($regex, $string);
print "Matched once: $matched\n";

$matched = regex_many('..', $string);
print "match_many with pattern: $matched\n";

$matched = regex_many($regex, $string);
print "match_many with qr// construct: $matched\n";

Re: Regexes from XS

Reply via email to