Re: [bitc-dev] Unicode RegExp Hell

Jonathan S. Shapiro Mon, 21 Apr 2014 10:52:31 -0700

I'm underway on an implementation, so this is just a quick follow-up.

First, thanks to a side comment by Ivan Goddard, I'm doing away with the
notion of token priorities. Token IDs will remain, because we need them for
lexing. I'm basically going to a two-priority system. An RE that consists
exclusively of single character matching and concatenation (e.g. "do") is
assumed to match in preference to one that uses other operations (e.g.
"[_a-zA-Z][_a-zA-Z0-9]*"). At some point I'll probably extend that to
case-insensitive keyword matching somehow, but for now I don't need that.


I've also zeroed in on a slightly different bytecode than previously
described. The differences exist mainly for convenience of representation:

STOP  (current thread)
STOPALL  (all threads)
ACCEPT *tokenid*
JUMP *targetpc*
FORK *targetpc*
MATCH {  (extended opcode taking the following parameters:
  BASIC *first last*   (Unicode code points from basic plane)
  EXTENDED *first last* (Any unicode code point set)
}


The BASIC parameter is purely a representation optimization; anything you
can encode with BASIC can also be encoded with EXTENDED.

FORK is just a different encoding of SPLIT, for instruction space reasons.

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode RegExp Hell

Reply via email to