Re: [Jprogramming] HTML parsing - how to optimize substring based selection?

Raul Miller Fri, 15 Jun 2007 15:10:10 -0700

Here's some code which extracts named element tags from an
xml string:


st1=:3 :0
 cols=.<"0'<>',~.y
 n=.1+#cols
 rows=.    ,:n {.4.1 NB. start
 rows=.rows, 4.1 2,}.n#1 NB. will accept
 rows=.rows, 4.2,n#0.3 NB. accept
 rows=.rows, 4.1,n#3 NB. ignore
 rows=.rows, 4.1,.3,.3,.~(3+(_2,~[:}:[EMAIL PROTECTED]) * (=/~.))y
 0;(0 10#:10*rows);<cols
)
selel=: [EMAIL PROTECTED] }:@;: '<',~]

Example use:
  'a ' selel i
where i is some xml or html.

Note that this returns some tags which seltag did not
recognize.  This seems to have to do with the way st was
designed, but I've not examined this issue very closely.

Note also that I recommend explicitly including a space after
the element name.  Perhaps this should instead be incorporated
into the body of the definition of st1.  But this is important, for
example, to prevent <applet or <abbr from being treated as <a.

Note that this technique only works for element names, and not
for attributes.  However, unless you are looking for the same
attribute on different elements you can still speed things up
by first restricting the data you are searching to the elements
of interest.

FYI,

--
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] HTML parsing - how to optimize substring based selection?

Reply via email to