Regexes from XS

Marvin Humphrey Thu, 18 May 2006 22:01:29 -0700

Greets,

I have a tokenizing algorithm which uses regexes, and it wouldpresumably faster if it were implemented in XS. The algorithmitself, I've appended below, as it's less important than the moregeneral concept of how to get at regexes from XS.

It looks like the relevant functions are pregcomp() and pregexec().There isn't anything about these in perlapi, so accessing them mightbe a little naughty. However, I have found some prior art: Tk usesthem, in the file tkGlue.c.

What follows is what I've been able to deduce so far. If someone canhelp fill in the blanks, I'll be much obliged.


This function header is from regcomp.c:

regexp *
Perl_pregcomp(pTHX_ char *exp, char *xend, PMOP *pm)

I gather that the first two arguments to pregcomp are the start andthe limit (a la SvEND) of the pattern. The returned regexp*, itlooks like I would immediately supply to pregexec(). I'm not toosure how to supply a PMOP*, but I saw in a Nick Ing-Simmons post top5p that you have to "fake an op" in order to make this work. Lookslike that's what this function from Tk does:


/* An "XS" routine to call with G_EVAL set */
static void
do_comp(pTHX_ CV *cv)
{
  dMARK;
  dAX;

struct WrappedRegExp *p = (struct WrappedRegExp *) CvXSUBANY(cv).any_ptr;

  int len = 0;
  char *string = Tcl_GetStringFromObj(p->source,&len);
  p->op.op_pmdynflags |= PMdf_DYN_UTF8;
  p->pat = pregcomp(string,string+len,&p->op);
#if 0
  LangDebug("/%.*s/ => %p\n",len,string,p->pat);
#endif
  XSRETURN(0);
}

It seems the PMOP stores some flags which affect how pregcomp()behaves. In this case, it appears that pregcomp() needs to know thatUTF-8 is in effect. Comments elsewhere in tkGlue.c indicate that anystring coming from Tk will be UTF-8.


This function header is from regexec.c:

/*
- pregexec - match a regexp against a string
*/
I32

Perl_pregexec(pTHX_ register regexp *prog, char *stringarg, registerchar *strend,

     char *strbeg, I32 minend, SV *screamer, U32 nosave)
/* strend: pointer to null at end of string */
/* strbeg: real beginning of string */
/* minend: end of match must be >=minend after stringarg. */
/* nosave: For optimizations. */
{

I think I understand most of that. stringarg may differ from strbegif, for example, we're in the middle of an m//g sequence. I'm notsure under what circumstances it would be useful to set minend tosomething other than 0, but maybe for the tokenizer it should be 1.One of these days I'll figure out what a "screaming" SV is, but it'sclear from the Tk example that it can simply be the SV that to whichstrarg belongs. nosave looks like it affects whether matches will besaved, though I'm not clear whether that means $1 $2 etc, or $` etc,or both.

Most of the code in the Tk function which invokes pregexec() isconcerned with wrapping an SV around a C string. The actual matchingonly takes one line...


int
Tcl_RegExpExec(interp, re, cstring, cstart)
Tcl_Interp *interp;
Tcl_RegExp re;
CONST char *cstring;
CONST char *cstart;
{
  dTHX;
  SV *tmp = sv_newmortal();
  int code;
  sv_upgrade(tmp,SVt_PV);
  SvCUR_set(tmp,strlen(cstring));
  SvPVX(tmp) = (char *) cstring;
  SvLEN(tmp) = 0;
  SvREADONLY_on(tmp);
  SvPOK_on(tmp);
  /* From Tk all strings are UTF-8 */
  SvUTF8_on(tmp);
#ifdef ROPT_MATCH_UTF8
  RX_MATCH_UTF8_on(re->pat);
#else
  /* eeek what do we do now ... */
#endif
  code = pregexec(re->pat,SvPVX(tmp),SvEND(tmp),(char *) cstart,0,
                 tmp,REXEC_COPY_STR);
#if 0
  LangDebug("%d '%.*s'\n",code,SvCUR(tmp),SvPVX(tmp));
  sv_dump(tmp);
  regdump(re->pat);
#endif
  return code;
}

If I were to rewrite the tokenizer algo below in C, I surmise that Iwould keep calling pregexex(), advancing stringarg each time, untilthe match fails.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

sub tokenize {
    for ( $_[0] ) {
        pos = 0;
        my ( @starts, @ends );
        1 while (
            m/$separator_re/g        # could be qr/\W*/
            and push @starts, pos
            and m/$token_re/g        # could be qr/\w+/
            and push @ends, pos
        );
        add_many_tokens( $_,  [EMAIL PROTECTED], [EMAIL PROTECTED] );
    }
}

Regexes from XS

Reply via email to