[Jprogramming] implementing regular expressions in J

Raul Miller Sat, 19 Jan 2013 12:12:05 -0800

I was revisiting regular expressions, and I decided to do an
implementation in J.


To keep it simple, I left out:

1) irregular expressions (parenthesized references)
2) parsing
3) optimizations

While all of these are possible, they complicate things.

To represent a regular expression, I decided to use a bit array with
three dimensions, representing three different kinds of sets in a
regular expression recognizer:

The first dimension represents "current state"
The second dimension represents "ascii character"
The third dimension represents "next state"

This is not quite complete -- we also need to know which state(s) to
start in, so a regular expression will actually be a boxed pair, with
the first pair being a bit list with 1 for current state, and the
second box will contain the transition array.

Since a three dimensional bit array is uncomfortable to read, one of
the first things I built was a routine to display this representation
in a readable format.

disp=: <@(#&a.)"1@(1&|:)`I.@.(1=#@$)L:0

And, one of the last things I wrote was a routine to translate a
string into a regular fragment which would match that string, and this
looks like this (note that this needs to be viewed in a fixed width
font):

   disp stringf 'this'
+-+----------+
|0|++-+-+-+-+|
| |||t| | | ||
| |++-+-+-+-+|
| ||| |h| | ||
| |++-+-+-+-+|
| ||| | |i| ||
| |++-+-+-+-+|
| ||| | | |s||
| |++-+-+-+-+|
+-+----------+

Or, if I express this in the pcre syntax for regular expressions, it
would be simply: 'this'

A "regular fragment" is just like the "regular expression" data
structure I described, except that the details of the final state are
not specified.

In this display the 0 in the leftmost box means that this fragment has
only state 0 for its initial state.  And in the second box we see the
state definitions (from top to bottom) and the states we transition to
(from left to right) and the character which performs that transition
(in the inner boxes).

So... how do we build this kind of thing?

First, I needed something that would build a transition for a set of characters:

chf=: 1 0; ,:@,.~&0@e.~&a.

Here's how this works:

   disp chf 'this'
+-+-------+
|0|++----+|
| |||hist||
| |++----+|
+-+-------+

In other words, any of the characters in the argument to chf will give
us a transition from the first state to the final state of this
fragment.  In the pcre syntax for regular expressions, this would be
'[this]'

Next, we need a routine to let us combine fragments sequentially.
This is a bit more complicated:

checksize=: -: _1 256 0 + (,.1 0 1) +/ .* ]
and=: *
or=: and&.-.

seqf=:4 :0
  'IX TX'=. x assert. TX checksize&$ IX
  'IY TY'=. y assert. TY checksize&$ IY
  I=. (}:IX),({:IX)and IY
  T=. ((}:"1 TX),"1({:"1 TX)and/IY),(-($TY)+0 0,<:{:$TX){. TY
  I;T
)

Note that you could use boolean definitions for 'and' and 'or' instead
of bayesian (*. and +. instead of * and *&.-.) - they are equivalent
for bit data.

Note that the assert. statements do nothing if we set a global
parameter to disable them.  But I like them there because they catch
some simple mistakes

Anyways, what's happening here is that when I combine my initial
states, the final state of the first fragment becomes any of the
initial states in the second fragment.  Also, any transition to the
final state of the first fragment becomes a transition to any of the
initial states of the second fragment.

Another thing we might want to do involves alternation -- we have two
regular expression fragments and we want to have a regular expression
that matches if either of our alternatives match.  That's something
like this:

orf=:4 :0
  'IX TX'=. x assert. TX checksize&$ IX
  'IY TY'=. y assert. TY checksize&$ IY
  I=. (}:IX),(}:IY),IX or&{: IY
  T=. TX, (-($TY)+0 0,<:{:$TX){. TY
  I;T
)

Here, our new start state is almost the start states of our two
options strung together.  The only thing we have to do is recognize
that the final state of the combination is present in the start state
if it was in either of our original start states.

And the states transitions themselves are used "as is" (except that we
string them together.

So now we can do something like this:

   disp (chf 'b') orf (chf 'c') seqf chf 'd'
+---+--------+
|0 1|++-+-+-+|
|   |||b| | ||
|   |++-+-+-+|
|   ||| |c| ||
|   |++-+-+-+|
|   ||| | |d||
|   |++-+-+-+|
+---+--------+

Note that we now start with two states.  We can't put our 'b' and 'c'
transitions in the same state because 'd' is valid after 'c' but not
after 'b'.  This corresponds to the pcre expression 'b|cd'.

Now watch what happens if I add an 'a' in front of this last
expression (something like this pcre expression:  'a(b|cd)'):

   disp (chf 'a') seqf (chf 'b') orf (chf 'c') seqf chf 'd'
+-+----------+
|0|++-+-+-+-+|
| |||a|a| | ||
| |++-+-+-+-+|
| ||| |b| | ||
| |++-+-+-+-+|
| ||| | |c| ||
| |++-+-+-+-+|
| ||| | | |d||
| |++-+-+-+-+|
+-+----------+

Now we only have one starting state.  But an 'a' in that starting
state gives us a transition to two different states.

Another thing we might want to do with a regular expression fragment
is to say that it can be repeated an arbitrary number of times.  This
is equivalent to saying that any transition to the end state must also
be a transition to the start states.

repf=:3 :0
  'I T'=. y assert. T checksize&$ I
  I;T+.({:"1 T) and/ I
)

   NB. pcre:  '(a(b|cd))+'
   disp repf (chf 'a') seqf (chf 'b') orf (chf 'c') seqf chf 'd'
+-+-----------+
|0|+-+-+-+-+-+|
| || |a|a| | ||
| |+-+-+-+-+-+|
| || | |b| | ||
| |+-+-+-+-+-+|
| || | | |c| ||
| |+-+-+-+-+-+|
| ||d| | | |d||
| |+-+-+-+-+-+|
+-+-----------+

Another thing we might want to say is that a regular expression is
optional.  This is equivalent to saying that the end state is an
initial state:

optf=:3 :0
  'I T'=. y assert. T checksize&$ I
  (I+.1 {.~-#I);T
)

   NB. pcre:  'a?'
   disp optf chf 'a'
+---+----+
|0 1|++-+|
|   |||a||
|   |++-+|
+---+----+

Or, for example

   NB. pcre: '(a|(b|cd?))+'
   disp repf (chf 'a') seqf (chf 'b') orf (chf 'c') seqf optf chf 'd'
+-+-----------+
|0|+-+-+-+-+-+|
| || |a|a| | ||
| |+-+-+-+-+-+|
| || | |b| | ||
| |+-+-+-+-+-+|
| ||c| | |c|c||
| |+-+-+-+-+-+|
| ||d| | | |d||
| |+-+-+-+-+-+|
+-+-----------+

Did I leave out anything important? Quite possibly, this is getting long.

But here are a few other routines, and some tests:

NB. string to fragment
stringf=: [: > [: seqf&.>/ chf&.>

NB. turn a regular fragment into a regular expression
complete=: ({.~ (>.|.)@$)L:0

match=:4 :0
  'STATE NEXT'=. complete y
  for_CHAR. a.i.x do.
      STATE=. STATE or/ .and CHAR {"2 NEXT
  end.
  {:STATE
)

assert (disp chf'a')-:,L:0]0;<,:'';'a'
assert (disp 'a' seqf&chf 'b')-:,L:0]0;<('';'a'),:'';'';'b'
assert (disp 'a' orf&chf 'bc')-:,L:0]0 1;<('';'a'),:'';'';'bc'
assert (disp optf chf 'a')-:,L:0]0 1;<,:'';'a'
assert (disp repf chf 'a')-:,L:0]0;<,:'a';'a'
assert 'abc' match stringf 'abc'
assert 'aaab' match (repf chf 'a') seqf chf 'b'
assert 'aaab' match (repf optf chf 'a') seqf chf 'b'
assert 'b' match (repf optf chf 'a') seqf chf 'b'
assert -.'aaabb' match (repf optf chf 'a') seqf chf 'b'

I should also note that pcre has an irregular "reference" feature
associated with its parenthesis.  It also offers a more complex syntax
that provides groups that cannot be referenced, but I thought it would
be simpler to just ignore the irregularities.

Also, note that there are a variety of optimizations that could be
implemented, which I did not implement.  For example

   NB. pcre 'a|b'
   disp 'a' orf&chf 'b'
   NB. pcre '[ab]'
   disp chf 'ab'

These two fragments match exactly the same way, but the data
structures are different.  We could build an code which translates the
longer form to the shorter form.

Another kind of optimization involves a scanner that instead of
scanning sequentially scans every nth character to see if the match
would be even possible.  This could be difficult for the general case,
but for a simple string match it's very easy to implement.

It's also perhaps worth implementing reference-able sub-matches.  But
that makes the matching algorithm considerably more complicated.  The
pcre code handles this by using an algorithm with exponential (slow)
performance in some cases.

Note that I did not show any example like 'a*' in pcre.  Here are two
ways of doing that:

   disp optf repf chf 'a'
   disp repf optf chf 'a'

We could even define:

kleenestare=: optf@repf

and this should always produce the same result as repf@optf

Since this implementation is very simple, it should be
not-too-difficult to implement a parser or to implement a more concise
set of operations for building regular expressions from characters.
But that's enough, for me, for now.

(And, hopefully I've not implemented any really badly erroneous verbs
here.  But when building this code I encountered cases where routines
that worked on simple data structures failed on longer structures.
Usually, simplifying the code fixed these problems.)

FYI,

-- 
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

[Jprogramming] implementing regular expressions in J

Reply via email to