On May 19, 2006, at 4:53 AM, [EMAIL PROTECTED] wrote:

:It seems the PMOP stores some flags which affect how pregcomp()
:behaves.  In this case, it appears that pregcomp() needs to know that
:UTF-8 is in effect.  Comments elsewhere in tkGlue.c indicate that any
:string coming from Tk will be UTF-8.

Yes, you probably want to avoid that if you don't need utf8.

I assume that this means the *pattern* will be interpreted as if it was in UTF-8, which would most likely happen within the scope of a 'use utf8;' pragma (hence your spelling of 'utf8'). The behavior of the regex against the string to be matched has to be determined too late for pregcomp(), since it is based on the value of each scalar's SvUTF8 flag. IOW, regardless of the conditions under which the REGEXP* struct was prepared, it has to be ready to deal with either UTF-8 scalars or byte-oriented scalars.

It makes my head hurt to consider what would happen if a UTF8 scalar had to be interpolated into a pattern outside the scope of a utf8 pragma.

    $foo =~ qr/stuff$a_utf8_string/;

I imagine you'd have to perform scalar concatenation, then set the UTF-8 flag on the PMOP based on the value of the UTF8 flag of the concatenated string.

I'm starting to understand why this stuff isn't in the official public API. :)

In my case, what I'll really need to do is retrieve a precompiled regular expression from within a passed-in qr// construct. That's another headache. Maybe I think it means I don't have to worry about pregcomp() at all, though.

I'm not
:sure under what circumstances it would be useful to set minend to
:something other than 0, but maybe for the tokenizer it should be 1.

I think this is there to handle the avoidance of infinite zero-length
matches - after a zero-length match at position p, the next match
must end *after* position p. This allows correct behaviour for things
like:
  "abc" =~ /( | . )/xg;
returns: ("", "a", "", "b", "", "c", "").

Yes. Snooping the code of pp_match in pp_hot.c, I see that the variable minmatch starts out as 0, but if global matching is in effect, it can be reset to something else on subsequent loops.

When doing //g matches, supply 0 for the first call; for subsequent calls
supply (I think) $+[0] + (matchlen == 0 ? 1 : 0).

I've decided that I can simplify my tokenizing algo:

sub tokenize {
    my ( $token_re, $source_text ) = @_;
        # accumulate token start_offsets and end_offsets
        my ( @starts, @ends );
        1 while (
            m/$token_re/g
            and push @starts, $-[0]
            and push @ends, $+[0]
        );

        # add the new tokens to the batch
        add_many_tokens( $_, [EMAIL PROTECTED], [EMAIL PROTECTED] );
    }
}

Unfortunately, experimenting with this uncovered a bug in my algo. @ +, @-, and pos() all give answers in terms of characters if the scalar which matched was marked with SvUTF8. But I always need @starts and @ends measured in *bytes*.

At the C level, I can get at that information using the startp and endp members of the regexp struct. Unfortunately, that's a deeper violation of the private API than I intended. There are two levels of hackery here: there's naughty, and then there's evil. Using pregcomp() and pregexec() is naughty. Using rx->endp[0] is evil.

Oh well. I'm in too deep to quit now. At least I'm learning a lot. See below for a demo app that manages to successfully match once. I haven't figured out how to turn on global matching though, and I definitely need that for the tokenizer.

:One of these days I'll figure out what a "screaming" SV is, but it's
:clear from the Tk example that it can simply be the SV that to which
:strarg belongs.

This is the Boyer-Moore optimisation. Off the top of my head, it is
applied only when you 'study()' the target string - this upgrades the
string to a different type (SVt_PVBM) which adds a structure giving
a frequency table and linked lists of occurrences of each character in
the target string. This is useful only when you have one string to which
you plan to apply many patterns.

Interesting. I wonder why Tk bothers with it, then, since it looks like the matching from Tk is all one-shot and the SV* gets discarded.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


#!/usr/bin/perl
use strict;
use warnings;
use Inline C => <<'END_C';

SV*
regex_once(SV *regex_ref, SV *text) {
    SV *regex_sv;
    regexp *rx;
    char *stringarg, *stringbeg, *stringend;
    MAGIC *mg = NULL;

    if (!SvROK(regex_ref)) croak("not a ref");
    regex_sv = SvRV(regex_ref);
    if (!SvMAGICAL(regex_sv)) croak("No magic");
    mg       = mg_find(regex_sv, PERL_MAGIC_qr);
    rx       = (REGEXP*)mg->mg_obj;

    stringbeg = SvPV_nolen(text);
    stringarg = stringbeg;
    stringend = SvEND(text);

    pregexec(rx, stringarg, stringend, stringbeg, 1, text, 1);

    return newSVpv(stringbeg, rx->endp[0]);
}

void
regex_many(SV *either, SV *text) {
    PMOP *pm;
    SV *regex_sv;
    regexp *rx;
    char *stringarg, *stringbeg, *stringend;
    MAGIC *mg = NULL;
    int safety = 0;

    New(1, pm, 1, PMOP);
    pm->op_pmflags |= PMf_GLOBAL;

    if (SvROK(either)) {
        regex_sv = SvRV(either);
        if (!SvMAGICAL(regex_sv)) croak("No magic");
        mg       = mg_find(regex_sv, PERL_MAGIC_qr);
        rx       = (REGEXP*)mg->mg_obj;
    }
    else {
        if (!SvPOK(either)) croak("need a pattern");
        rx = pregcomp(SvPVX(either), SvEND(either), pm);
    }

    stringbeg = SvPV_nolen(text);
    stringarg = stringbeg;
    stringend = SvEND(text);

    while (pregexec(rx, stringarg, stringend, stringbeg, 1, text, 1)) {
        stringarg = stringbeg + rx->endp[0];
        fprintf(stderr, "%d\n", safety);
        if (safety++ > 10)
            break;
    }

    return newSVpv(stringbeg, rx->endp[0]);
}

END_C

my $regex = qr/../;
my $string = join '', 'a' .. 'z';
my $matched = regex_once($regex, $string);
print "Matched once: $matched\n";

$matched = regex_many('..', $string);
print "match_many with pattern: $matched\n";

$matched = regex_many($regex, $string);
print "match_many with qr// construct: $matched\n";

Reply via email to