On 02.04.2011 14:58, Peter Chadwick wrote:
Dmitry Olshansky<[email protected]>  writes:

On 31.03.2011 22:16,[email protected]  wrote:
Dmitry Olshansky<[email protected]>   writes:
--- somewhat informal draft ---

Hi Dmitry,

It's good to know that there is interest in improving regular
expressions in the D standard library. I've run into a number of
problems using std.regex myself, so I'm keen so see either fixes for
std.regex or an improved replacement.
Thanks for the interest.
Indeed there are problems, though the engine itself is quite capable
and optimized, I think it could be fixed.
Great.

I've been working on a regular expression library myself based around
Russ Cox article athttp://swtch.com/~rsc/regexp/regexp2.html  and the
std.regex range interface. It includes a backtracking and Thompson
engine. If you are interested it is on github:

https://github.com/PeteChadwick/pcregexeng

The ddoc documentation is here:
http://petechadwick.github.com/pcregexeng/

Wow, I'll definitely look into it.
Overall it seems solid even though incomplete. From glance it seems
that parsing a pattern is somewhat convulted, but I'll defer any
critics or decisions until I looked in closer detail.
I'm open to any criticisms or suggestions you may have.

Ok, I think I got my criticisms sorted out.
Sorry for taking so much time, I've got really involved week, attending to a couple events and, in fact, helping organize one.

First things first - I like your structured way of expressing regex VM instructions by introducing distinct types for each, but let's examine what good does it get you in the end.
Function printProgram gets us first impressions:
Inst* inst = cast(Inst*)&program[pos];
final switch( *instType )
 {
...
case REInst.IChar:
     InstIChar* inst = cast(InstIChar*)&program[pos]; //disadvantage

 writefln( "IChar %s", inst.c ); //benefit
pos += InstIChar.sizeof;//not used //well to actually benefit of introduced types, should it be (*inst).sizeof ? ...

Another use case before jumping to conclusions:
auto instChar = cast(InstChar*)inst;//disadvantage
if ( instChar.c != thisChar )//benefit
    return 0;
state_pc += InstChar.sizeof;//not used

Which means that the whole typing isn't exploited much, but gets in your way far to often. The idea to avoid all the mess with double casting is to introduce tag union:
struct IR{
    uint opcode;
    union{
        struct {
            dchar c;
        }
        struct {
            ubyte[8] bitmap;
        }
//etc..
    }
}
Then the double cast is removed:
auto inst = cast(IR*)&program[pos];
switch(inst.InstType){
...
case REInst.IChar:

 writefln( "IChar %s", inst.c ); pos += InstIChar.sizeof;


}
The only problem that remains is that we've lost the actual size of struct which varies based on an opcode value, and constructors...
Solving both at once :
struct WrapedIR(REInst InstType,T)
{
    enum length = instLength(InstType);
    @property auto bytes(){// see later
        return (cast(byte*)data)[0..length];
    }
    T data;
   alias data this;
}
where instLength is the unsafe place :
size_t instLength(REInst type){
    switch(type){
    case REInst.Char: case REInst.IChar:
            return sizeof(uint)+sizeof(dchar);
    ...
    }
}
and having helper function (which may use @property) like :
auto makeInst(IRInst inst)() { return WrappedIR!(inst,IR)(); }

will get us pretty much the same thing (or better) everywhere you create instructions:
auto charInst = makeInst!(REInst.Char)();

Which gets us to how you are creating instructions actually:
  static string MakeREInst( string instType, string instName )
    {
        string result =
            "byte["~instType~".sizeof] "~instName~"Buf;"~
instType~"* "~instName~" = cast("~instType~"*)&"~instName~"Buf[0];"~
            "*"~instName~" = "~instType~"();";

        return result;
    }
which is this after mix-in:
byte[instType.sizeof] instNameBuf;
instType* instName = cast(InstType*)&instNameBuf[0];
*instName = instType();

Even disregarding the suggested approach to IR representation, the better alternative I think would be the opposite - stack allocate struct, and treat it as array of bytes when need be. Isn't this one of prime cases for introducing types in the first place? Also, why mixins? instType instName = instType(); //can't be any worse then mixin(MakeREInst( "instType", "instName" );

Then for getting bytes of any struct use this one-liner (analogous to 'bytes' of the above IR tag union):
ubyte[] bytesOf(T)(T* t){
    return (cast(byte*)t)[0..T.sizeof];
}
Essentially you need to treat instructions as bytes exactly once: when appending or inserting them into VM program.
program ~=  splitInstBuf  //append magically mixed-in buffer
    ...
becomes
program ~=  bytesOf(&splitInst)
    ...
Doesn't look half bad, and compiler wouldn't allow missing '&' if anything.
Look & feel clearly could be enhanced by changing pointer argument to ref argument in bytesOf, but I don't think that current DMD inliner works well with ref. (and we _do_ need inliner for such functions)

One more unrelated thing is the way you do the insertion of some IR instructions (this one from parseRegex, there are more):
program =
program[0..progConcatStart] // gets a slice of program, but since it's followed by other data it has capacity of 0
               ~ splitInstBuf //and so this always allocates new array!
~ program[progConcatStart..$]; //and this might or might not allocate
So no matter how innocent  it looks, it's  1-2 allocation _every_ time.
To solve this troublesome consequence you need something like this (not actually tested ):
program.length += splitInstBuf.length;
memmove(
&program[progConcatStart+splitInstBuf.length],
&program[progConcatStart],
    program[0].sizeof*(program.length-progConcatStart)
);
program[progConcatStart..progConcatStart+splitBuf.length] = splitBuf[0..$];

...or simply cheat ;) and use std.array.insert:
insert(program,progConcatStart,splitBuf);

To further improve things also check on how to exploit Appender in Phobos, it should be faster for continuous appending.

--
Dmitry Olshansky

Reply via email to