Re: [GSOC idea] enhance regular expressions?

Dmitry Olshansky Sun, 10 Apr 2011 13:26:47 -0700

On 02.04.2011 14:58, Peter Chadwick wrote:

Dmitry Olshansky<[email protected]>  writes:

On 31.03.2011 22:16,[email protected]  wrote:

Dmitry Olshansky<[email protected]>   writes:

--- somewhat informal draft ---

Hi Dmitry,

It's good to know that there is interest in improving regular
expressions in the D standard library. I've run into a number of
problems using std.regex myself, so I'm keen so see either fixes for
std.regex or an improved replacement.

Thanks for the interest.
Indeed there are problems, though the engine itself is quite capable
and optimized, I think it could be fixed.

Great.

I've been working on a regular expression library myself based around
Russ Cox article athttp://swtch.com/~rsc/regexp/regexp2.html  and the
std.regex range interface. It includes a backtracking and Thompson
engine. If you are interested it is on github:

https://github.com/PeteChadwick/pcregexeng

The ddoc documentation is here:
http://petechadwick.github.com/pcregexeng/

Wow, I'll definitely look into it.
Overall it seems solid even though incomplete. From glance it seems
that parsing a pattern is somewhat convulted, but I'll defer any
critics or decisions until I looked in closer detail.

I'm open to any criticisms or suggestions you may have.


Ok, I think I got my criticisms sorted out.

Sorry for taking so much time, I've got really involved week, attendingto a couple events and, in fact, helping organize one.

First things first - I like your structured way of expressing regex VMinstructions by introducing distinct types for each, but let's examinewhat good does it get you in the end.

Function printProgram gets us first impressions:
Inst* inst = cast(Inst*)&program[pos];
final switch( *instType )
 {
...
case REInst.IChar:
     InstIChar* inst = cast(InstIChar*)&program[pos]; //disadvantage

 writefln( "IChar %s", inst.c ); //benefit

pos += InstIChar.sizeof;//not used //well to actuallybenefit of introduced types, should it be (*inst).sizeof ? ...


Another use case before jumping to conclusions:
auto instChar = cast(InstChar*)inst;//disadvantage
if ( instChar.c != thisChar )//benefit
    return 0;
state_pc += InstChar.sizeof;//not used

Which means that the whole typing isn't exploited much, but gets in yourway far to often.The idea to avoid all the mess with double casting is to introduce tagunion:

struct IR{
    uint opcode;
    union{
        struct {
            dchar c;
        }
        struct {
            ubyte[8] bitmap;
        }
//etc..
    }
}
Then the double cast is removed:
auto inst = cast(IR*)&program[pos];
switch(inst.InstType){
...
case REInst.IChar:

 writefln( "IChar %s", inst.c ); pos += InstIChar.sizeof;


}

The only problem that remains is that we've lost the actual size ofstruct which varies based on an opcode value, and constructors...

Solving both at once :
struct WrapedIR(REInst InstType,T)
{
    enum length = instLength(InstType);
    @property auto bytes(){// see later
        return (cast(byte*)data)[0..length];
    }
    T data;
   alias data this;
}
where instLength is the unsafe place :
size_t instLength(REInst type){
    switch(type){
    case REInst.Char: case REInst.IChar:
            return sizeof(uint)+sizeof(dchar);
    ...
    }
}
and having helper function (which may use @property) like :
auto makeInst(IRInst inst)() { return WrappedIR!(inst,IR)(); }

will get us pretty much the same thing (or better) everywhere you createinstructions:

auto charInst = makeInst!(REInst.Char)();

Which gets us to how you are creating instructions actually:
  static string MakeREInst( string instType, string instName )
    {
        string result =
            "byte["~instType~".sizeof] "~instName~"Buf;"~

instType~"* "~instName~" =cast("~instType~"*)&"~instName~"Buf[0];"~

            "*"~instName~" = "~instType~"();";

        return result;
    }
which is this after mix-in:
byte[instType.sizeof] instNameBuf;
instType* instName = cast(InstType*)&instNameBuf[0];
*instName = instType();

Even disregarding the suggested approach to IR representation, thebetter alternative I think would be the opposite - stack allocatestruct, and treat it as array of bytes when need be.Isn't this one of prime cases for introducing types in the first place?Also, why mixins?instType instName = instType(); //can't be any worse thenmixin(MakeREInst( "instType", "instName" );

Then for getting bytes of any struct use this one-liner (analogous to'bytes' of the above IR tag union):

ubyte[] bytesOf(T)(T* t){
    return (cast(byte*)t)[0..T.sizeof];
}

Essentially you need to treat instructions as bytes exactly once: whenappending or inserting them into VM program.

program ~=  splitInstBuf  //append magically mixed-in buffer
    ...
becomes
program ~=  bytesOf(&splitInst)
    ...
Doesn't look half bad, and compiler wouldn't allow missing '&' if anything.

Look & feel clearly could be enhanced by changing pointer argument toref argument in bytesOf, but I don't think that current DMD inlinerworks well with ref. (and we _do_ need inliner for such functions)

One more unrelated thing is the way you do the insertion of some IRinstructions (this one from parseRegex, there are more):

program =

program[0..progConcatStart] // gets a slice of program, but since it'sfollowed by other data it has capacity of 0

               ~ splitInstBuf //and so this always allocates new array!

~ program[progConcatStart..$]; //and this might ormight not allocate

So no matter how innocent  it looks, it's  1-2 allocation _every_ time.

To solve this troublesome consequence you need something like this (notactually tested ):

program.length += splitInstBuf.length;
memmove(
&program[progConcatStart+splitInstBuf.length],
&program[progConcatStart],
    program[0].sizeof*(program.length-progConcatStart)
);
program[progConcatStart..progConcatStart+splitBuf.length] = splitBuf[0..$];

...or simply cheat ;) and use std.array.insert:
insert(program,progConcatStart,splitBuf);

To further improve things also check on how to exploit Appender inPhobos, it should be faster for continuous appending.


--
Dmitry Olshansky

Re: [GSOC idea] enhance regular expressions?

Reply via email to