Greets,

I have a tokenizing algorithm which uses regexes, and it would presumably faster if it were implemented in XS. The algorithm itself, I've appended below, as it's less important than the more general concept of how to get at regexes from XS.

It looks like the relevant functions are pregcomp() and pregexec(). There isn't anything about these in perlapi, so accessing them might be a little naughty. However, I have found some prior art: Tk uses them, in the file tkGlue.c.

What follows is what I've been able to deduce so far. If someone can help fill in the blanks, I'll be much obliged.

This function header is from regcomp.c:

regexp *
Perl_pregcomp(pTHX_ char *exp, char *xend, PMOP *pm)

I gather that the first two arguments to pregcomp are the start and the limit (a la SvEND) of the pattern. The returned regexp*, it looks like I would immediately supply to pregexec(). I'm not too sure how to supply a PMOP*, but I saw in a Nick Ing-Simmons post to p5p that you have to "fake an op" in order to make this work. Looks like that's what this function from Tk does:

/* An "XS" routine to call with G_EVAL set */
static void
do_comp(pTHX_ CV *cv)
{
  dMARK;
  dAX;
struct WrappedRegExp *p = (struct WrappedRegExp *) CvXSUBANY (cv).any_ptr;
  int len = 0;
  char *string = Tcl_GetStringFromObj(p->source,&len);
  p->op.op_pmdynflags |= PMdf_DYN_UTF8;
  p->pat = pregcomp(string,string+len,&p->op);
#if 0
  LangDebug("/%.*s/ => %p\n",len,string,p->pat);
#endif
  XSRETURN(0);
}


It seems the PMOP stores some flags which affect how pregcomp() behaves. In this case, it appears that pregcomp() needs to know that UTF-8 is in effect. Comments elsewhere in tkGlue.c indicate that any string coming from Tk will be UTF-8.

This function header is from regexec.c:

/*
- pregexec - match a regexp against a string
*/
I32
Perl_pregexec(pTHX_ register regexp *prog, char *stringarg, register char *strend,
     char *strbeg, I32 minend, SV *screamer, U32 nosave)
/* strend: pointer to null at end of string */
/* strbeg: real beginning of string */
/* minend: end of match must be >=minend after stringarg. */
/* nosave: For optimizations. */
{

I think I understand most of that. stringarg may differ from strbeg if, for example, we're in the middle of an m//g sequence. I'm not sure under what circumstances it would be useful to set minend to something other than 0, but maybe for the tokenizer it should be 1. One of these days I'll figure out what a "screaming" SV is, but it's clear from the Tk example that it can simply be the SV that to which strarg belongs. nosave looks like it affects whether matches will be saved, though I'm not clear whether that means $1 $2 etc, or $` etc, or both.

Most of the code in the Tk function which invokes pregexec() is concerned with wrapping an SV around a C string. The actual matching only takes one line...

int
Tcl_RegExpExec(interp, re, cstring, cstart)
Tcl_Interp *interp;
Tcl_RegExp re;
CONST char *cstring;
CONST char *cstart;
{
  dTHX;
  SV *tmp = sv_newmortal();
  int code;
  sv_upgrade(tmp,SVt_PV);
  SvCUR_set(tmp,strlen(cstring));
  SvPVX(tmp) = (char *) cstring;
  SvLEN(tmp) = 0;
  SvREADONLY_on(tmp);
  SvPOK_on(tmp);
  /* From Tk all strings are UTF-8 */
  SvUTF8_on(tmp);
#ifdef ROPT_MATCH_UTF8
  RX_MATCH_UTF8_on(re->pat);
#else
  /* eeek what do we do now ... */
#endif
  code = pregexec(re->pat,SvPVX(tmp),SvEND(tmp),(char *) cstart,0,
                 tmp,REXEC_COPY_STR);
#if 0
  LangDebug("%d '%.*s'\n",code,SvCUR(tmp),SvPVX(tmp));
  sv_dump(tmp);
  regdump(re->pat);
#endif
  return code;
}

If I were to rewrite the tokenizer algo below in C, I surmise that I would keep calling pregexex(), advancing stringarg each time, until the match fails.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

sub tokenize {
    for ( $_[0] ) {
        pos = 0;
        my ( @starts, @ends );
        1 while (
            m/$separator_re/g        # could be qr/\W*/
            and push @starts, pos
            and m/$token_re/g        # could be qr/\w+/
            and push @ends, pos
        );
        add_many_tokens( $_,  [EMAIL PROTECTED], [EMAIL PROTECTED] );
    }
}



Reply via email to