Re: [Jprogramming] HTML parsing - how to optimize substring based selection?

Raul Miller Mon, 18 Jun 2007 05:54:58 -0700

On 6/18/07, Yuvaraj Athur Raghuvir <[EMAIL PROTECTED]> wrote:

Q1: Why are there 6 columns when the input options are only 5?


Sixth column was for other characters (for example, '=').,
but it looks like it's really irrelevant since fifth column
already does that.

Q2: How does the states verb work?


Here's the body of that verb
 0 10 #: <. 10 * > -.&a: <@".;._2] 0 :0

From right to left:

  0 :0 reads a multi-line noun (character vector)
  ] breaks up parsing so _2 and 0 are in differeent words
  <@".;._2 breaks verb into lines, converting each to a box of numbers
  -.&a: removes blank lines (just NB. with no numbers)
  > converts to simple array of numbers
  10 * multiplies by 10 (4.1 becomes 41, 1 becomes 10)
  <. changes type from floating to integer (this is not very important)
  0 10 #: splits each number into two (next state, op code)

Basically, this allows me to specify my state machine as a two
dimensional array, of numbers where opcode is a fractional part
and "next state" is just itself as the integer part.  I think I remember
a conversation with Ken Iverson where he suggested this format,
and I suggested that the integer format (what ;: uses) would
save a conversion cost which might matter for something like a
state machine.

Q3: How does the st1 verb work in creating the other rows of the state
machine? The last two lines of st1 verb are quite involved....


Well, basically it's a fixed prefix for the first four states and
then a dynamically built part which matches its argument.

Just to recap those first states:

In any state, getting the '<' character sends you to state 4
(the custom part of the state machine), and begins a new
word.

State 0, 1 and 3 mostly just stay in that state for other
characters, except state 1 sends you on to state 2 when
it gets a '>' character.

Note also that state 0 and state 3 have the same behavior.
You could eliminate state 3 by moving all later states up
one, and replacing transitions to state 3 to state 0.  I think
this would simplify the generation of the dynamic part of the
machine.

Anyways, state 2 ends the word with the previous characters
(and, as I mentioned above, starts a new word if this next
character is '<').

I think you probably understood all of that.  So, anyways, here's
the line which generates the dynamic part of the array:
  rows=.rows, 4.1,.3,.3,.~(3+(_2,~[:}:[EMAIL PROTECTED]) * (=/~.))y

Here,
  (=/~.)y
generates the basic structure of this dynamic array.  Rows
correspond to states, columns correspond to represented
characters.  Note that I've assumed that '<>' do not appear
in y.  Probably I should have instead used
  (=/ '<>' ~.@, ]) y
and restructured the rest of the line to fit.

Note that I need the ~. for matching names like
'button' which have repeated characters.

Anyways,
  _2,~[:}:[EMAIL PROTECTED]) y
generates the basic sequence numbers for "next state" which will be
multiplied by the structure array (from =/~.).  Note that numbers here
are "off by three" so that  "default values" will (non-match transitions)
fall into state 3. after I add three to everything.  [If I had just used
state 0 for my "doesn't match state" this part would have been more
comprehensible.]

For an overview of what I'm doing here, consider
  (([EMAIL PROTECTED]) * (=/~.))'button'
or, more simply
  1 2 3 4 5 6 * (=/~.)'button'

Anyways, the point is that for each matching state, the state machine
advances on to the next state, and I've rigged the final 'next matching
state' to send me to state 1 (where we wait for a '>' and then accept
the entire sequence as a word).

I hope this makes sense.  Let me know if I skipped anything important.

I was a bit sloppy in constructing this state machine -- it looks like I
have both an extra row and an extra column.  These do not interfere
with the function of the machine, but do make it bulkier and harder
to understand.

Thanks,

--
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] HTML parsing - how to optimize substring based selection?

Reply via email to