I think the problem here is that I referenced stringf higher up on the page than I defined it.
As an aside, I used the 'f' suffix to mark a function which produces a fragment result. In this case, it converts a string to a fragment which would match that string: stringf=: [: > [: seqf&.>/ chf&.> If you search the original text for stringf=: you should find that definition. FYI, -- Raul On Sun, Jan 20, 2013 at 7:45 AM, Linda Alvord <[email protected]> wrote: > To try this we need stringf. > > Linda > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Raul Miller > Sent: Saturday, January 19, 2013 3:12 PM > To: Programming forum > Subject: [Jprogramming] implementing regular expressions in J > > I was revisiting regular expressions, and I decided to do an implementation > in J. > > To keep it simple, I left out: > y > 1) irregular expressions (parenthesized references) > 2) parsing > 3) optimizations > > While all of these are possible, they complicate things. > > To represent a regular expression, I decided to use a bit array with three > dimensions, representing three different kinds of sets in a regular > expression recognizer: > > The first dimension represents "current state" > The second dimension represents "ascii character" > The third dimension represents "next state" > > This is not quite complete -- we also need to know which state(s) to start > in, so a regular expression will actually be a boxed pair, with the first > pair being a bit list with 1 for current state, and the second box will > contain the transition array. > > Since a three dimensional bit array is uncomfortable to read, one of the > first things I built was a routine to display this representation in a > readable format. > > disp=: <@(#&a.)"1@(1&|:)`I.@.(1=#@$)L:0 > > And, one of the last things I wrote was a routine to translate a string into > a regular fragment which would match that string, and this looks like this > (note that this needs to be viewed in a fixed width > font): > > disp stringf 'this' > +-+----------+ > |0|++-+-+-+-+| > | |||t| | | || > | |++-+-+-+-+| > | ||| |h| | || > | |++-+-+-+-+| > | ||| | |i| || > | |++-+-+-+-+| > | ||| | | |s|| > | |++-+-+-+-+| > +-+----------+ > > Or, if I express this in the pcre syntax for regular expressions, it would > be simply: 'this' > > A "regular fragment" is just like the "regular expression" data structure I > described, except that the details of the final state are not specified. > > In this display the 0 in the leftmost box means that this fragment has only > state 0 for its initial state. And in the second box we see the state > definitions (from top to bottom) and the states we transition to (from left > to right) and the character which performs that transition (in the inner > boxes). > > So... how do we build this kind of thing? > > First, I needed something that would build a transition for a set of > characters: > > chf=: 1 0; ,:@,.~&0@e.~&a. > > Here's how this works: > > disp chf 'this' > +-+-------+ > |0|++----+| > | |||hist|| > | |++----+| > +-+-------+ > > In other words, any of the characters in the argument to chf will give us a > transition from the first state to the final state of this fragment. In the > pcre syntax for regular expressions, this would be '[this]' > > Next, we need a routine to let us combine fragments sequentially. > This is a bit more complicated: > > checksize=: -: _1 256 0 + (,.1 0 1) +/ .* ] > and=: * > or=: and&.-. > > seqf=:4 :0 > 'IX TX'=. x assert. TX checksize&$ IX > 'IY TY'=. y assert. TY checksize&$ IY > I=. (}:IX),({:IX)and IY > T=. ((}:"1 TX),"1({:"1 TX)and/IY),(-($TY)+0 0,<:{:$TX){. TY > I;T > ) > > Note that you could use boolean definitions for 'and' and 'or' instead of > bayesian (*. and +. instead of * and *&.-.) - they are equivalent for bit > data. > > Note that the assert. statements do nothing if we set a global parameter to > disable them. But I like them there because they catch some simple mistakes > > Anyways, what's happening here is that when I combine my initial states, the > final state of the first fragment becomes any of the initial states in the > second fragment. Also, any transition to the final state of the first > fragment becomes a transition to any of the initial states of the second > fragment. > > Another thing we might want to do involves alternation -- we have two > regular expression fragments and we want to have a regular expression that > matches if either of our alternatives match. That's something like this: > > orf=:4 :0 > 'IX TX'=. x assert. TX checksize&$ IX > 'IY TY'=. y assert. TY checksize&$ IY > I=. (}:IX),(}:IY),IX or&{: IY > T=. TX, (-($TY)+0 0,<:{:$TX){. TY > I;T > ) > > Here, our new start state is almost the start states of our two options > strung together. The only thing we have to do is recognize that the final > state of the combination is present in the start state if it was in either > of our original start states. > > And the states transitions themselves are used "as is" (except that we > string them together. > > So now we can do something like this: > > disp (chf 'b') orf (chf 'c') seqf chf 'd' > +---+--------+ > |0 1|++-+-+-+| > | |||b| | || > | |++-+-+-+| > | ||| |c| || > | |++-+-+-+| > | ||| | |d|| > | |++-+-+-+| > +---+--------+ > > Note that we now start with two states. We can't put our 'b' and 'c' > transitions in the same state because 'd' is valid after 'c' but not after > 'b'. This corresponds to the pcre expression 'b|cd'. > > Now watch what happens if I add an 'a' in front of this last expression > (something like this pcre expression: 'a(b|cd)'): > > disp (chf 'a') seqf (chf 'b') orf (chf 'c') seqf chf 'd' > +-+----------+ > |0|++-+-+-+-+| > | |||a|a| | || > | |++-+-+-+-+| > | ||| |b| | || > | |++-+-+-+-+| > | ||| | |c| || > | |++-+-+-+-+| > | ||| | | |d|| > | |++-+-+-+-+| > +-+----------+ > > Now we only have one starting state. But an 'a' in that starting state > gives us a transition to two different states. > > Another thing we might want to do with a regular expression fragment is to > say that it can be repeated an arbitrary number of times. This is > equivalent to saying that any transition to the end state must also be a > transition to the start states. > > repf=:3 :0 > 'I T'=. y assert. T checksize&$ I > I;T+.({:"1 T) and/ I > ) > > NB. pcre: '(a(b|cd))+' > disp repf (chf 'a') seqf (chf 'b') orf (chf 'c') seqf chf 'd' > +-+-----------+ > |0|+-+-+-+-+-+| > | || |a|a| | || > | |+-+-+-+-+-+| > | || | |b| | || > | |+-+-+-+-+-+| > | || | | |c| || > | |+-+-+-+-+-+| > | ||d| | | |d|| > | |+-+-+-+-+-+| > +-+-----------+ > > Another thing we might want to say is that a regular expression is optional. > This is equivalent to saying that the end state is an initial state: > > optf=:3 :0 > 'I T'=. y assert. T checksize&$ I > (I+.1 {.~-#I);T > ) > > NB. pcre: 'a?' > disp optf chf 'a' > +---+----+ > |0 1|++-+| > | |||a|| > | |++-+| > +---+----+ > > Or, for example > > NB. pcre: '(a|(b|cd?))+' > disp repf (chf 'a') seqf (chf 'b') orf (chf 'c') seqf optf chf 'd' > +-+-----------+ > |0|+-+-+-+-+-+| > | || |a|a| | || > | |+-+-+-+-+-+| > | || | |b| | || > | |+-+-+-+-+-+| > | ||c| | |c|c|| > | |+-+-+-+-+-+| > | ||d| | | |d|| > | |+-+-+-+-+-+| > +-+-----------+ > > Did I leave out anything important? Quite possibly, this is getting long. > > But here are a few other routines, and some tests: > > NB. string to fragment > stringf=: [: > [: seqf&.>/ chf&.> > > NB. turn a regular fragment into a regular expression > complete=: ({.~ (>.|.)@$)L:0 > > match=:4 :0 > 'STATE NEXT'=. complete y > for_CHAR. a.i.x do. > STATE=. STATE or/ .and CHAR {"2 NEXT > end. > {:STATE > ) > > assert (disp chf'a')-:,L:0]0;<,:'';'a' > assert (disp 'a' seqf&chf 'b')-:,L:0]0;<('';'a'),:'';'';'b' > assert (disp 'a' orf&chf 'bc')-:,L:0]0 1;<('';'a'),:'';'';'bc' > assert (disp optf chf 'a')-:,L:0]0 1;<,:'';'a' > assert (disp repf chf 'a')-:,L:0]0;<,:'a';'a' > assert 'abc' match stringf 'abc' > assert 'aaab' match (repf chf 'a') seqf chf 'b' > assert 'aaab' match (repf optf chf 'a') seqf chf 'b' > assert 'b' match (repf optf chf 'a') seqf chf 'b' > assert -.'aaabb' match (repf optf chf 'a') seqf chf 'b' > > I should also note that pcre has an irregular "reference" feature associated > with its parenthesis. It also offers a more complex syntax that provides > groups that cannot be referenced, but I thought it would be simpler to just > ignore the irregularities. > > Also, note that there are a variety of optimizations that could be > implemented, which I did not implement. For example > > NB. pcre 'a|b' > disp 'a' orf&chf 'b' > NB. pcre '[ab]' > disp chf 'ab' > > These two fragments match exactly the same way, but the data structures are > different. We could build an code which translates the longer form to the > shorter form. > > Another kind of optimization involves a scanner that instead of scanning > sequentially scans every nth character to see if the match would be even > possible. This could be difficult for the general case, but for a simple > string match it's very easy to implement. > > It's also perhaps worth implementing reference-able sub-matches. But that > makes the matching algorithm considerably more complicated. The pcre code > handles this by using an algorithm with exponential (slow) performance in > some cases. > > Note that I did not show any example like 'a*' in pcre. Here are two ways > of doing that: > > disp optf repf chf 'a' > disp repf optf chf 'a' > > We could even define: > > kleenestare=: optf@repf > > and this should always produce the same result as repf@optf > > Since this implementation is very simple, it should be not-too-difficult to > implement a parser or to implement a more concise set of operations for > building regular expressions from characters. > But that's enough, for me, for now. > > (And, hopefully I've not implemented any really badly erroneous verbs here. > But when building this code I encountered cases where routines that worked > on simple data structures failed on longer structures. > Usually, simplifying the code fixed these problems.) > > FYI, > > -- > Raul > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
