Re: Update on HTMLParser

Jordi Salvat i Alabart Wed, 08 Oct 2003 05:41:53 -0700

I was not thinking about using regexps instead of a decent HTML parser, but if they were really faster, it could well be worth having both methods available. It would need to be _really_ faster to be worth the hassle, but from experience I know it could well be (although you also gave reasons to think it won't be).

You're right that HTML is dirty and the regexps will be difficult, but I'm familiar with the issue and already have some previously used in Perl scripts... for example, get all image URIs by:

(?si)<IMG(?=\s)[^\>]*?\sSRC\s*=\s*"([^">]*)"

Others are more difficult -- for example stylesheets:

m{(?si)<LINK(?=\s)(?:[^\>]*?\s(?:HREF\s*=\s*"([^">]*)"|REL\s*=\s*"stylesheet")){2,}}g

I'll give it a shot so that we can compare -- it's important, because I've seen that processing responses is one of JMeter's biggest CPU hogs. We will probably be able to use the results for extractors, too.

--
Salut,

Jordi.

peter lin wrote:


I'm not convinced a regexp approach would be better
than HtmlParser for a couple of reasons.

- HtmlParser already works on the stream directly
using readers.

- java regexp is decent, but not blazing fast like
perl regexp.

- to make it easy to extend, regexp isn't ideal.

- html is dirty, so a developer would need sufficient
expertise with regexp to get it to work correctly.

- I'm not a regexp guru, but if some one else is
willing to try to write a generalize package for
scanning specific tags that can handle dirty html it
would be great.

- I'd rather write a Html compiler reading the bytes
directly than use regexp.

- HtmlParser is sufficiently fast and efficient that I
think it is a good candidate to replace tidy. Plus I
don't like having to build DOM just to get the images.

I'm open to ideas. If no one objects, I will continue
as planned and complete the new sampler using
HtmlParser.

peter

--- Jordi Salvat i Alabart <[EMAIL PROTECTED]> wrote:

My experience with -Xincgc is that it never helps: the overhead it adds is so huge that the shorter GC pauses never compensate for it.

Have you thought about a regexp-based implementation? It would be less correct, but probably good enough, and possibly much faster.
--
Salut,
Jordi.

peter lin wrote:

I ran some benchmarks today with a new version of
httpsamplerfull using HtmlParser. the results are
interesting. Perhaps the biggest and most
interesting discovery for me is the dramatic
difference in performance between with and without
-Xincgc.
http://tao.altern8.net:8080/comparison_summary.pdf

the results are in pdf format.

when I run JMeter with incremental GC, HtmlParser
version beats Tidy easily, but without incremental
GC, the performance gain is marginal as the number
of threads increase.
it would appear incremental GC hinders DOM and
Tidy performance and results in a steady increase in
heap size. Without incremental GC, the response time
with HtmlParser is generally faster than with Tidy
by 5-10%. Under which circumstances is using -Xincgc
better for JMeter?
the jdk I am using is 1.4.1 on windows.

peter
---------------------------------
Do you Yahoo!?
The New Yahoo! Shopping - with improved product
search

---------------------------------------------------------------------

To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]


__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Update on HTMLParser

Reply via email to