[Jprogramming] HTML parsing - how to optimize substring based selection?

Yuvaraj Athur Raghuvir Fri, 15 Jun 2007 13:40:09 -0700

Hello,

To parse a html and get specific tags my strategy has been the following
1) Use ;: as a parser to generate the array of tags
2) Filter on the array to get tags of interest.


For (1) I have done as follows:
st NB. state machine description
+-+---+-----+
|0|1 1|+-+-+|
| |0 0||<|>||
| |0 0|+-+-+|
| |   |     |
| |1 1|     |
| |2 0|     |
| |1 0|     |
| |   |     |
| |1 1|     |
| |0 3|     |
| |0 3|     |
+-+---+-----+
i =. freads h  NB. sample html file h read into i
j =. st ;: i

For (2) I have created a verb seltag as follows:
seltag =: 4 : 'y{~I.@:(a: &i.) @: (x&(I.@:E.) each)y'

To find all the anchor tags, I do the following:
anc =. '<a'
k =. anc seltag j

Now, for the sample file I looked into, the space requirement for running
seltag is 1000 times the size of j! I think this is not ok.

Any suggestions on how to speed up the selection in the array based on
substring match?

Also, pointers on where I am consuming more space will help me learn.

Thanks and Regards,
Yuva
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

[Jprogramming] HTML parsing - how to optimize substring based selection?

Reply via email to