Marvin Humphrey <[EMAIL PROTECTED]> wrote: :You may or may not recall that I brought up the subject of regexes :from on Perlmonks. I'm back on the case. This message was just sent :to the Perl XS list. Perhaps it will hold some interest for you.
I'm not subscribed to that list, but I'm interested in the discussion. I'll cc this to the list too. :This function header is from regcomp.c: : :regexp * :Perl_pregcomp(pTHX_ char *exp, char *xend, PMOP *pm) : :I gather that the first two arguments to pregcomp are the start and :the limit (a la SvEND) of the pattern. Yes, xend points just after the last character of the pattern to be compiled. :The returned regexp*, it :looks like I would immediately supply to pregexec(). I'm not too :sure how to supply a PMOP*, but I saw in a Nick Ing-Simmons post to :p5p that you have to "fake an op" in order to make this work. Looks :like that's what this function from Tk does: : :/* An "XS" routine to call with G_EVAL set */ :static void :do_comp(pTHX_ CV *cv) :{ : dMARK; : dAX; : struct WrappedRegExp *p = (struct WrappedRegExp *) CvXSUBANY :(cv).any_ptr; : int len = 0; : char *string = Tcl_GetStringFromObj(p->source,&len); : p->op.op_pmdynflags |= PMdf_DYN_UTF8; : p->pat = pregcomp(string,string+len,&p->op); :#if 0 : LangDebug("/%.*s/ => %p\n",len,string,p->pat); :#endif : XSRETURN(0); :} : : :It seems the PMOP stores some flags which affect how pregcomp() :behaves. In this case, it appears that pregcomp() needs to know that :UTF-8 is in effect. Comments elsewhere in tkGlue.c indicate that any :string coming from Tk will be UTF-8. Yes, you probably want to avoid that if you don't need utf8. See op.h for the other PM* flags - note particularly the ones that represent flags on the pattern, like PMf_GLOBAL, PMf_EXTENDED etc. :This function header is from regexec.c: : :/* :- pregexec - match a regexp against a string :*/ :I32 :Perl_pregexec(pTHX_ register regexp *prog, char *stringarg, register :char *strend, : char *strbeg, I32 minend, SV *screamer, U32 nosave) :/* strend: pointer to null at end of string */ :/* strbeg: real beginning of string */ :/* minend: end of match must be >=minend after stringarg. */ :/* nosave: For optimizations. */ :{ : :I think I understand most of that. stringarg may differ from strbeg :if, for example, we're in the middle of an m//g sequence. I'm not :sure under what circumstances it would be useful to set minend to :something other than 0, but maybe for the tokenizer it should be 1. I think this is there to handle the avoidance of infinite zero-length matches - after a zero-length match at position p, the next match must end *after* position p. This allows correct behaviour for things like: "abc" =~ /( | . )/xg; returns: ("", "a", "", "b", "", "c", ""). When doing //g matches, supply 0 for the first call; for subsequent calls supply (I think) $+[0] + (matchlen == 0 ? 1 : 0). :One of these days I'll figure out what a "screaming" SV is, but it's :clear from the Tk example that it can simply be the SV that to which :strarg belongs. This is the Boyer-Moore optimisation. Off the top of my head, it is applied only when you 'study()' the target string - this upgrades the string to a different type (SVt_PVBM) which adds a structure giving a frequency table and linked lists of occurrences of each character in the target string. This is useful only when you have one string to which you plan to apply many patterns. :nosave looks like it affects whether matches will be :saved, though I'm not clear whether that means $1 $2 etc, or $` etc, :or both. I think this affects whether matches are copied or just the offsets within the string saved, and that it affects all the match variables. [...] Hugo