Re: [Jprogramming] implementing regular expressions in J

Raul Miller Sun, 20 Jan 2013 04:49:14 -0800

I think the problem here is that I referenced stringf higher up on the
page than I defined it.


As an aside, I used the 'f' suffix to mark a function which produces a
fragment result.  In this case, it converts a string to a fragment
which would match that string:

stringf=: [: > [: seqf&.>/ chf&.>

If you search the original text for stringf=: you should find that definition.

FYI,

-- 
Raul

On Sun, Jan 20, 2013 at 7:45 AM, Linda Alvord <[email protected]> wrote:
> To try this we need stringf.
>
> Linda
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Raul Miller
> Sent: Saturday, January 19, 2013 3:12 PM
> To: Programming forum
> Subject: [Jprogramming] implementing regular expressions in J
>
> I was revisiting regular expressions, and I decided to do an implementation
> in J.
>
> To keep it simple, I left out:
> y
> 1) irregular expressions (parenthesized references)
> 2) parsing
> 3) optimizations
>
> While all of these are possible, they complicate things.
>
> To represent a regular expression, I decided to use a bit array with three
> dimensions, representing three different kinds of sets in a regular
> expression recognizer:
>
> The first dimension represents "current state"
> The second dimension represents "ascii character"
> The third dimension represents "next state"
>
> This is not quite complete -- we also need to know which state(s) to start
> in, so a regular expression will actually be a boxed pair, with the first
> pair being a bit list with 1 for current state, and the second box will
> contain the transition array.
>
> Since a three dimensional bit array is uncomfortable to read, one of the
> first things I built was a routine to display this representation in a
> readable format.
>
> disp=: <@(#&a.)"1@(1&|:)`I.@.(1=#@$)L:0
>
> And, one of the last things I wrote was a routine to translate a string into
> a regular fragment which would match that string, and this looks like this
> (note that this needs to be viewed in a fixed width
> font):
>
>    disp stringf 'this'
> +-+----------+
> |0|++-+-+-+-+|
> | |||t| | | ||
> | |++-+-+-+-+|
> | ||| |h| | ||
> | |++-+-+-+-+|
> | ||| | |i| ||
> | |++-+-+-+-+|
> | ||| | | |s||
> | |++-+-+-+-+|
> +-+----------+
>
> Or, if I express this in the pcre syntax for regular expressions, it would
> be simply: 'this'
>
> A "regular fragment" is just like the "regular expression" data structure I
> described, except that the details of the final state are not specified.
>
> In this display the 0 in the leftmost box means that this fragment has only
> state 0 for its initial state.  And in the second box we see the state
> definitions (from top to bottom) and the states we transition to (from left
> to right) and the character which performs that transition (in the inner
> boxes).
>
> So... how do we build this kind of thing?
>
> First, I needed something that would build a transition for a set of
> characters:
>
> chf=: 1 0; ,:@,.~&0@e.~&a.
>
> Here's how this works:
>
>    disp chf 'this'
> +-+-------+
> |0|++----+|
> | |||hist||
> | |++----+|
> +-+-------+
>
> In other words, any of the characters in the argument to chf will give us a
> transition from the first state to the final state of this fragment.  In the
> pcre syntax for regular expressions, this would be '[this]'
>
> Next, we need a routine to let us combine fragments sequentially.
> This is a bit more complicated:
>
> checksize=: -: _1 256 0 + (,.1 0 1) +/ .* ]
> and=: *
> or=: and&.-.
>
> seqf=:4 :0
>   'IX TX'=. x assert. TX checksize&$ IX
>   'IY TY'=. y assert. TY checksize&$ IY
>   I=. (}:IX),({:IX)and IY
>   T=. ((}:"1 TX),"1({:"1 TX)and/IY),(-($TY)+0 0,<:{:$TX){. TY
>   I;T
> )
>
> Note that you could use boolean definitions for 'and' and 'or' instead of
> bayesian (*. and +. instead of * and *&.-.) - they are equivalent for bit
> data.
>
> Note that the assert. statements do nothing if we set a global parameter to
> disable them.  But I like them there because they catch some simple mistakes
>
> Anyways, what's happening here is that when I combine my initial states, the
> final state of the first fragment becomes any of the initial states in the
> second fragment.  Also, any transition to the final state of the first
> fragment becomes a transition to any of the initial states of the second
> fragment.
>
> Another thing we might want to do involves alternation -- we have two
> regular expression fragments and we want to have a regular expression that
> matches if either of our alternatives match.  That's something like this:
>
> orf=:4 :0
>   'IX TX'=. x assert. TX checksize&$ IX
>   'IY TY'=. y assert. TY checksize&$ IY
>   I=. (}:IX),(}:IY),IX or&{: IY
>   T=. TX, (-($TY)+0 0,<:{:$TX){. TY
>   I;T
> )
>
> Here, our new start state is almost the start states of our two options
> strung together.  The only thing we have to do is recognize that the final
> state of the combination is present in the start state if it was in either
> of our original start states.
>
> And the states transitions themselves are used "as is" (except that we
> string them together.
>
> So now we can do something like this:
>
>    disp (chf 'b') orf (chf 'c') seqf chf 'd'
> +---+--------+
> |0 1|++-+-+-+|
> |   |||b| | ||
> |   |++-+-+-+|
> |   ||| |c| ||
> |   |++-+-+-+|
> |   ||| | |d||
> |   |++-+-+-+|
> +---+--------+
>
> Note that we now start with two states.  We can't put our 'b' and 'c'
> transitions in the same state because 'd' is valid after 'c' but not after
> 'b'.  This corresponds to the pcre expression 'b|cd'.
>
> Now watch what happens if I add an 'a' in front of this last expression
> (something like this pcre expression:  'a(b|cd)'):
>
>    disp (chf 'a') seqf (chf 'b') orf (chf 'c') seqf chf 'd'
> +-+----------+
> |0|++-+-+-+-+|
> | |||a|a| | ||
> | |++-+-+-+-+|
> | ||| |b| | ||
> | |++-+-+-+-+|
> | ||| | |c| ||
> | |++-+-+-+-+|
> | ||| | | |d||
> | |++-+-+-+-+|
> +-+----------+
>
> Now we only have one starting state.  But an 'a' in that starting state
> gives us a transition to two different states.
>
> Another thing we might want to do with a regular expression fragment is to
> say that it can be repeated an arbitrary number of times.  This is
> equivalent to saying that any transition to the end state must also be a
> transition to the start states.
>
> repf=:3 :0
>   'I T'=. y assert. T checksize&$ I
>   I;T+.({:"1 T) and/ I
> )
>
>    NB. pcre:  '(a(b|cd))+'
>    disp repf (chf 'a') seqf (chf 'b') orf (chf 'c') seqf chf 'd'
> +-+-----------+
> |0|+-+-+-+-+-+|
> | || |a|a| | ||
> | |+-+-+-+-+-+|
> | || | |b| | ||
> | |+-+-+-+-+-+|
> | || | | |c| ||
> | |+-+-+-+-+-+|
> | ||d| | | |d||
> | |+-+-+-+-+-+|
> +-+-----------+
>
> Another thing we might want to say is that a regular expression is optional.
> This is equivalent to saying that the end state is an initial state:
>
> optf=:3 :0
>   'I T'=. y assert. T checksize&$ I
>   (I+.1 {.~-#I);T
> )
>
>    NB. pcre:  'a?'
>    disp optf chf 'a'
> +---+----+
> |0 1|++-+|
> |   |||a||
> |   |++-+|
> +---+----+
>
> Or, for example
>
>    NB. pcre: '(a|(b|cd?))+'
>    disp repf (chf 'a') seqf (chf 'b') orf (chf 'c') seqf optf chf 'd'
> +-+-----------+
> |0|+-+-+-+-+-+|
> | || |a|a| | ||
> | |+-+-+-+-+-+|
> | || | |b| | ||
> | |+-+-+-+-+-+|
> | ||c| | |c|c||
> | |+-+-+-+-+-+|
> | ||d| | | |d||
> | |+-+-+-+-+-+|
> +-+-----------+
>
> Did I leave out anything important? Quite possibly, this is getting long.
>
> But here are a few other routines, and some tests:
>
> NB. string to fragment
> stringf=: [: > [: seqf&.>/ chf&.>
>
> NB. turn a regular fragment into a regular expression
> complete=: ({.~ (>.|.)@$)L:0
>
> match=:4 :0
>   'STATE NEXT'=. complete y
>   for_CHAR. a.i.x do.
>       STATE=. STATE or/ .and CHAR {"2 NEXT
>   end.
>   {:STATE
> )
>
> assert (disp chf'a')-:,L:0]0;<,:'';'a'
> assert (disp 'a' seqf&chf 'b')-:,L:0]0;<('';'a'),:'';'';'b'
> assert (disp 'a' orf&chf 'bc')-:,L:0]0 1;<('';'a'),:'';'';'bc'
> assert (disp optf chf 'a')-:,L:0]0 1;<,:'';'a'
> assert (disp repf chf 'a')-:,L:0]0;<,:'a';'a'
> assert 'abc' match stringf 'abc'
> assert 'aaab' match (repf chf 'a') seqf chf 'b'
> assert 'aaab' match (repf optf chf 'a') seqf chf 'b'
> assert 'b' match (repf optf chf 'a') seqf chf 'b'
> assert -.'aaabb' match (repf optf chf 'a') seqf chf 'b'
>
> I should also note that pcre has an irregular "reference" feature associated
> with its parenthesis.  It also offers a more complex syntax that provides
> groups that cannot be referenced, but I thought it would be simpler to just
> ignore the irregularities.
>
> Also, note that there are a variety of optimizations that could be
> implemented, which I did not implement.  For example
>
>    NB. pcre 'a|b'
>    disp 'a' orf&chf 'b'
>    NB. pcre '[ab]'
>    disp chf 'ab'
>
> These two fragments match exactly the same way, but the data structures are
> different.  We could build an code which translates the longer form to the
> shorter form.
>
> Another kind of optimization involves a scanner that instead of scanning
> sequentially scans every nth character to see if the match would be even
> possible.  This could be difficult for the general case, but for a simple
> string match it's very easy to implement.
>
> It's also perhaps worth implementing reference-able sub-matches.  But that
> makes the matching algorithm considerably more complicated.  The pcre code
> handles this by using an algorithm with exponential (slow) performance in
> some cases.
>
> Note that I did not show any example like 'a*' in pcre.  Here are two ways
> of doing that:
>
>    disp optf repf chf 'a'
>    disp repf optf chf 'a'
>
> We could even define:
>
> kleenestare=: optf@repf
>
> and this should always produce the same result as repf@optf
>
> Since this implementation is very simple, it should be not-too-difficult to
> implement a parser or to implement a more concise set of operations for
> building regular expressions from characters.
> But that's enough, for me, for now.
>
> (And, hopefully I've not implemented any really badly erroneous verbs here.
> But when building this code I encountered cases where routines that worked
> on simple data structures failed on longer structures.
> Usually, simplifying the code fixed these problems.)
>
> FYI,
>
> --
> Raul
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] implementing regular expressions in J

Reply via email to