A practical strategy to parse HTML is two step:
- apply tidy to convert to XHTML
http://tidy.sourceforge.net/
- apply a custom SAX handler from xml/sax addon
to process or convert to J structures
--- Yuvaraj Athur Raghuvir <[EMAIL PROTECTED]> wrote:
> Hello,
>
> To parse a html and get specific tags my strategy has been the following
> 1) Use ;: as a parser to generate the array of tags
> 2) Filter on the array to get tags of interest.
>
> For (1) I have done as follows:
> st NB. state machine description
> +-+---+-----+
> |0|1 1|+-+-+|
> | |0 0||<|>||
> | |0 0|+-+-+|
> | | | |
> | |1 1| |
> | |2 0| |
> | |1 0| |
> | | | |
> | |1 1| |
> | |0 3| |
> | |0 3| |
> +-+---+-----+
> i =. freads h NB. sample html file h read into i
> j =. st ;: i
>
> For (2) I have created a verb seltag as follows:
> seltag =: 4 : 'y{~I.@:(a: &i.) @: (x&(I.@:E.) each)y'
>
> To find all the anchor tags, I do the following:
> anc =. '<a'
> k =. anc seltag j
>
> Now, for the sample file I looked into, the space requirement for running
> seltag is 1000 times the size of j! I think this is not ok.
>
> Any suggestions on how to speed up the selection in the array based on
> substring match?
>
> Also, pointers on where I am consuming more space will help me learn.
>
> Thanks and Regards,
> Yuva
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
____________________________________________________________________________________
Get the Yahoo! toolbar and be alerted to new email wherever you're surfing.
http://new.toolbar.yahoo.com/toolbar/features/mail/index.php
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm