You're right that HTML is dirty and the regexps will be difficult, but I'm familiar with the issue and already have some previously used in Perl scripts... for example, get all image URIs by:
(?si)<IMG(?=\s)[^\>]*?\sSRC\s*=\s*"([^">]*)"
Others are more difficult -- for example stylesheets:
m{(?si)<LINK(?=\s)(?:[^\>]*?\s(?:HREF\s*=\s*"([^">]*)"|REL\s*=\s*"stylesheet")){2,}}g
I'll give it a shot so that we can compare -- it's important, because I've seen that processing responses is one of JMeter's biggest CPU hogs. We will probably be able to use the results for extractors, too.
-- Salut,
Jordi.
peter lin wrote:
I'm not convinced a regexp approach would be better than HtmlParser for a couple of reasons.
- HtmlParser already works on the stream directly using readers.
- java regexp is decent, but not blazing fast like perl regexp.
- to make it easy to extend, regexp isn't ideal.
- html is dirty, so a developer would need sufficient expertise with regexp to get it to work correctly.
- I'm not a regexp guru, but if some one else is willing to try to write a generalize package for scanning specific tags that can handle dirty html it would be great.
- I'd rather write a Html compiler reading the bytes directly than use regexp.
- HtmlParser is sufficiently fast and efficient that I think it is a good candidate to replace tidy. Plus I don't like having to build DOM just to get the images.
I'm open to ideas. If no one objects, I will continue as planned and complete the new sampler using HtmlParser.
peter
--- Jordi Salvat i Alabart <[EMAIL PROTECTED]> wrote:
My experience with -Xincgc is that it never helps:
the overhead it adds is so huge that the shorter GC pauses never
compensate for it.
Have you thought about a regexp-based
implementation? It would be less correct, but probably good enough, and possibly much
faster.
-- Salut,
Jordi.
peter lin wrote:
I ran some benchmarks today with a new version of
httpsamplerfull using HtmlParser. the results are interesting. Perhaps the biggest and most interesting discovery for me is the dramatic difference in performance between with and without -Xincgc.
http://tao.altern8.net:8080/comparison_summary.pdf
the results are in pdf format.
when I run JMeter with incremental GC, HtmlParser
version beats Tidy easily, but without incremental GC, the performance gain is marginal as the number of threads increase.
it would appear incremental GC hinders DOM and
Tidy performance and results in a steady increase in heap size. Without incremental GC, the response time with HtmlParser is generally faster than with Tidy by 5-10%. Under which circumstances is using -Xincgc better for JMeter?
the jdk I am using is 1.4.1 on windows.
peter
--------------------------------- Do you Yahoo!? The New Yahoo! Shopping - with improved product
search
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
__________________________________ Do you Yahoo!? The New Yahoo! Shopping - with improved product search http://shopping.yahoo.com
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]