Re: [Jprogramming] HTML parsing - how to optimize substring based selection?

Oleg Kobchenko Fri, 15 Jun 2007 16:25:04 -0700

A practical strategy to parse HTML is two step:

 - apply tidy to convert to XHTML
   http://tidy.sourceforge.net/


 - apply a custom SAX handler from xml/sax addon
   to process or convert to J structures


--- Yuvaraj Athur Raghuvir <[EMAIL PROTECTED]> wrote:

> Hello,
> 
> To parse a html and get specific tags my strategy has been the following
> 1) Use ;: as a parser to generate the array of tags
> 2) Filter on the array to get tags of interest.
> 
> For (1) I have done as follows:
> st NB. state machine description
> +-+---+-----+
> |0|1 1|+-+-+|
> | |0 0||<|>||
> | |0 0|+-+-+|
> | |   |     |
> | |1 1|     |
> | |2 0|     |
> | |1 0|     |
> | |   |     |
> | |1 1|     |
> | |0 3|     |
> | |0 3|     |
> +-+---+-----+
> i =. freads h  NB. sample html file h read into i
> j =. st ;: i
> 
> For (2) I have created a verb seltag as follows:
> seltag =: 4 : 'y{~I.@:(a: &i.) @: (x&(I.@:E.) each)y'
> 
> To find all the anchor tags, I do the following:
> anc =. '<a'
> k =. anc seltag j
> 
> Now, for the sample file I looked into, the space requirement for running
> seltag is 1000 times the size of j! I think this is not ok.
> 
> Any suggestions on how to speed up the selection in the array based on
> substring match?
> 
> Also, pointers on where I am consuming more space will help me learn.
> 
> Thanks and Regards,
> Yuva
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
> 



       
____________________________________________________________________________________
Get the Yahoo! toolbar and be alerted to new email wherever you're surfing.
http://new.toolbar.yahoo.com/toolbar/features/mail/index.php
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] HTML parsing - how to optimize substring based selection?

Reply via email to