Re: Update on HTMLParser

peter lin Wed, 08 Oct 2003 18:01:12 -0700

 
I've updated the PDF with additional results. I ran the a benchmark using default 
tomcat pages in console mode. Although there isn't a lot of difference in memory and 
CPU usage, it is consistently less than with Tidy.
 
The big improvement in console mode for 5 clients is the throughput in JMeter goes 
from 397 to 1075. Roughly 2.7x higher throughput.  to me that looks pretty impressive 
and should make it easier for others to load test servers with fewer Jmeter clients 
running.
 
I should be done with duplicating the existing functionality in the next day or two. I 
will make a list of the features people have requested for parsing HTML and start 
working on the ones that seem to add the high value first.
 
peter



Jordi Salvat i Alabart <[EMAIL PROTECTED]> wrote:
I was not thinking about using regexps instead of a decent HTML parser, 
but if they were really faster, it could well be worth having both 
methods available. It would need to be _really_ faster to be worth the 
hassle, but from experience I know it could well be (although you also 
gave reasons to think it won't be).

You're right that HTML is dirty and the regexps will be difficult, but 
I'm familiar with the issue and already have some previously used in 
Perl scripts... for example, get all image URIs by:

(?si)]*?\sSRC\s*=\s*"([^">]*)"

Others are more difficult -- for example stylesheets:

m{(?si)
]*?\s(?:HREF\s*=\s*"([^">]*)"|REL\s*=\s*"stylesheet")){2,}}g

I'll give it a shot so that we can compare -- it's important, because 
I've seen that processing responses is one of JMeter's biggest CPU hogs. 
We will probably be able to use the results for extractors, too.

-- 
Salut,

Jordi.

peter lin wrote:
> 
> I'm not convinced a regexp approach would be better
> than HtmlParser for a couple of reasons.
> 
> - HtmlParser already works on the stream directly
> using readers.
> 
> - java regexp is decent, but not blazing fast like
> perl regexp.
> 
> - to make it easy to extend, regexp isn't ideal.
> 
> - html is dirty, so a developer would need sufficient
> expertise with regexp to get it to work correctly.
> 
> - I'm not a regexp guru, but if some one else is
> willing to try to write a generalize package for
> scanning specific tags that can handle dirty html it
> would be great.
> 
> - I'd rather write a Html compiler reading the bytes
> directly than use regexp.
> 
> - HtmlParser is sufficiently fast and efficient that I
> think it is a good candidate to replace tidy. Plus I
> don't like having to build DOM just to get the images.
> 
> I'm open to ideas. If no one objects, I will continue
> as planned and complete the new sampler using
> HtmlParser.
> 
> peter
> 
> 
> --- Jordi Salvat i Alabart wrote:
> 
>>My experience with -Xincgc is that it never helps:
>>the overhead it adds 
>>is so huge that the shorter GC pauses never
>>compensate for it.
>>
>>Have you thought about a regexp-based
>>implementation? It would be less 
>>correct, but probably good enough, and possibly much
>>faster.
>>
>>-- 
>>Salut,
>>
>>Jordi.
>>
>>peter lin wrote:
>>
>>>I ran some benchmarks today with a new version of
>>
>>httpsamplerfull using HtmlParser. the results are
>>interesting. Perhaps the biggest and most
>>interesting discovery for me is the dramatic
>>difference in performance between with and without
>>-Xincgc.
>>
>>> 
>>>http://tao.altern8.net:8080/comparison_summary.pdf
>>> 
>>>the results are in pdf format.
>>> 
>>>when I run JMeter with incremental GC, HtmlParser
>>
>>version beats Tidy easily, but without incremental
>>GC, the performance gain is marginal as the number
>>of threads increase.
>>
>>> 
>>>it would appear incremental GC hinders DOM and
>>
>>Tidy performance and results in a steady increase in
>>heap size. Without incremental GC, the response time
>>with HtmlParser is generally faster than with Tidy
>>by 5-10%. Under which circumstances is using -Xincgc
>>better for JMeter?
>>
>>> 
>>>the jdk I am using is 1.4.1 on windows.
>>> 
>>>peter
>>> 
>>> 
>>>
>>>
>>>---------------------------------
>>>Do you Yahoo!?
>>>The New Yahoo! Shopping - with improved product
>>
>>search
>>
>>
>>
> 
> ---------------------------------------------------------------------
> 
>>To unsubscribe, e-mail:
>>[EMAIL PROTECTED]
>>For additional commands, e-mail:
>>[EMAIL PROTECTED]
>>
> 
> 
> 
> __________________________________
> Do you Yahoo!?
> The New Yahoo! Shopping - with improved product search
> http://shopping.yahoo.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search

Re: Update on HTMLParser

Reply via email to