Re: Update on HTMLParser

peter lin Wed, 08 Oct 2003 06:12:43 -0700

 
If you're willing to write some code, I'm definitely willing to benchmark the heck out 
of it and add support for both. I tend to do some fairly heavy load testing, so for me 
every bit of performance is worth it.
 
Originally, I tried using SAX and a customer ContentHandler to parse the desired tags, 
but HTML being dirty, the parser complained every other tag :)
 
my next thought was to write a html compiler/parser that reads the bytes directly and 
uses a stack based approach. After some thought I decided to look for existing parsers 
and found HtmlParser. I believe (possibly naively) a custom stack based parser could 
potentially be the fastest way to parse the HTML, but it would take considerable time 
to write by hand. I thought about using JavaCC to generate a HTML compiler. I haven't 
totally ruled out that approach, but don't think I have the time right now to do it.
 
Regexp is one type of state machine and some implementations use a stack approach, so 
the regexp approach could be significantly faster. I just haven't done the performance 
comparison to find out if it really does. I'll post the additional benchmarks I ran 
later today. it looks like the throughput using HtmlParser is 2-3x higher than with 
Tidy.
 
peter



Jordi Salvat i Alabart <[EMAIL PROTECTED]> wrote:
I was not thinking about using regexps instead of a decent HTML parser, 
but if they were really faster, it could well be worth having both 
methods available. It would need to be _really_ faster to be worth the 
hassle, but from experience I know it could well be (although you also 
gave reasons to think it won't be).

You're right that HTML is dirty and the regexps will be difficult, but 
I'm familiar with the issue and already have some previously used in 
Perl scripts... for example, get all image URIs by:

(?si)]*?\sSRC\s*=\s*"([^">]*)"

Others are more difficult -- for example stylesheets:

m{(?si)
]*?\s(?:HREF\s*=\s*"([^">]*)"|REL\s*=\s*"stylesheet")){2,}}g

I'll give it a shot so that we can compare -- it's important, because 
I've seen that processing responses is one of JMeter's biggest CPU hogs. 
We will probably be able to use the results for extractors, too.

-- 
Salut,

Jordi.

peter lin wrote:
> 
> I'm not convinced a regexp approach would be better
> than HtmlParser for a couple of reasons.
> 
> - HtmlParser already works on the stream directly
> using readers.
> 
> - java regexp is decent, but not blazing fast like
> perl regexp.
> 
> - to make it easy to extend, regexp isn't ideal.
> 
> - html is dirty, so a developer would need sufficient
> expertise with regexp to get it to work correctly.
> 
> - I'm not a regexp guru, but if some one else is
> willing to try to write a generalize package for
> scanning specific tags that can handle dirty html it
> would be great.
> 
> - I'd rather write a Html compiler reading the bytes
> directly than use regexp.
> 
> - HtmlParser is sufficiently fast and efficient that I
> think it is a good candidate to replace tidy. Plus I
> don't like having to build DOM just to get the images.
> 
> I'm open to ideas. If no one objects, I will continue
> as planned and complete the new sampler using
> HtmlParser.
> 
> peter
> 
> 
> --- Jordi Salvat i Alabart wrote:
> 
>>My experience with -Xincgc is that it never helps:
>>the overhead it adds 
>>is so huge that the shorter GC pauses never
>>compensate for it.
>>
>>Have you thought about a regexp-based
>>implementation? It would be less 
>>correct, but probably good enough, and possibly much
>>faster.
>>
>>-- 
>>Salut,
>>
>>Jordi.
>>
>>peter lin wrote:
>>
>>>I ran some benchmarks today with a new version of
>>
>>httpsamplerfull using HtmlParser. the results are
>>interesting. Perhaps the biggest and most
>>interesting discovery for me is the dramatic
>>difference in performance between with and without
>>-Xincgc.
>>
>>> 
>>>http://tao.altern8.net:8080/comparison_summary.pdf
>>> 
>>>the results are in pdf format.
>>> 
>>>when I run JMeter with incremental GC, HtmlParser
>>
>>version beats Tidy easily, but without incremental
>>GC, the performance gain is marginal as the number
>>of threads increase.
>>
>>> 
>>>it would appear incremental GC hinders DOM and
>>
>>Tidy performance and results in a steady increase in
>>heap size. Without incremental GC, the response time
>>with HtmlParser is generally faster than with Tidy
>>by 5-10%. Under which circumstances is using -Xincgc
>>better for JMeter?
>>
>>> 
>>>the jdk I am using is 1.4.1 on windows.
>>> 
>>>peter
>>> 
>>> 
>>>
>>>
>>>---------------------------------
>>>Do you Yahoo!?
>>>The New Yahoo! Shopping - with improved product
>>
>>search
>>
>>
>>
> 
> ---------------------------------------------------------------------
> 
>>To unsubscribe, e-mail:
>>[EMAIL PROTECTED]
>>For additional commands, e-mail:
>>[EMAIL PROTECTED]
>>
> 
> 
> 
> __________________________________
> Do you Yahoo!?
> The New Yahoo! Shopping - with improved product search
> http://shopping.yahoo.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search

Re: Update on HTMLParser

Reply via email to