Re: HTTPSamplerFull performance: HTMLParser vs. Regexp-based

peter lin Sun, 23 Nov 2003 20:01:17 -0800

hi jordi,

I couldn't download the attachment you added to the
bug. can you send it to me directly and i'll try to
get to it next month after the holidays.


thanks.


peter

--- Jordi Salvat i Alabart <[EMAIL PROTECTED]> wrote:
> No, it doesn't. JTidy works well.
> 
> I'm suspecting your guess is wrong... :-)
> 
> -- 
> Salut,
> 
> Jordi.
> 
> En/na peter lin ha escrit:
> > can you verify if the old JTidy implementation
> > contains the same bug?
> > 
> > I'm going to guess it's how I'm using htmlparser.
> > 
> > peter
> > 
> > 
> > --- Jordi Salvat i Alabart <[EMAIL PROTECTED]>
> wrote:
> > 
> >>Responding to myself again...
> >>
> >>I've been running some more tests with JVM
> arguments
> >>that I believe more 
> >>sensible, namely:
> >>
> >>-Xms256m -Xmx256m -XX:NewSize=64m
> -XX:MaxNewSize=64m
> >>
> >>-XX:MaxLiveObjectEvacuationRatio=40
> >>-XX:SurvivorRatio=8
> >>
> >>With this, the performance difference has almost
> >>disappeared: I'm 
> >>getting ca. 12 sample/second with the htmlparser,
> 15
> >>sample/second with 
> >>the regexp approach. The htmlparser solution
> >>generates about 5 times 
> >>more garbage than the regexp solution -- which
> >>explains why the results 
> >>were so tremendously different using -Xincgc.
> >>
> >>In this situation, I don't believe it's worth
> >>providing users with the 
> >>ability to choose which parser they want. I won't
> >>remove them now, but I 
> >>believe HtmlParser is the best choice,... once
> we'll
> >>have managed to 
> >>clean the outstanding bugs.
> >>
> >>The bugs I mentioned before (failure to parse a
> >>couple of image URLs) 
> >>still hold. I'll file them now.
> >>
> >>-- 
> >>Salut,
> >>
> >>Jordi.
> >>
> >>En/na Jordi Salvat i Alabart ha escrit:
> >>
> >>>Hi.
> >>>
> >>>I've finally found some time to test the
> >>
> >>performance of the 
> >>
> >>>HTTPSamplerFull implementation currently in CVS
> >>
> >>(developped by Peter Lin 
> >>
> >>>using HTMLParser) against the implementation I
> >>
> >>sent a while ago to the 
> >>
> >>>list (developped by me using Regexps). [Remember:
> >>
> >>the objective is not 
> >>
> >>>to decide which is best, but whether it's worth
> >>
> >>having both available to 
> >>
> >>>script developers].
> >>>
> >>>The results are not conclusive, but they prove
> >>
> >>that the issue deserves 
> >>
> >>>further analysis:
> >>>
> >>>1/ On the example I've been using, the
> >>
> >>Regexp-based implementation was 
> >>
> >>>more accurate than the HTMLParser-based one. This
> >>
> >>is very surprising to 
> >>
> >>>me, since I expected the Regexp-based
> >>
> >>implementation to be generally 
> >>
> >>>less accurate. I'll need some help on this one.
> >>
> >>More details later.
> >>
> >>>2/ On the example I've been using, the
> >>
> >>Regexp-based implementation was 
> >>
> >>>at least 7 times faster than the HTTPParser-based
> >>
> >>one. A quick look at 
> >>
> >>>the code suggests that the HTML Parser is being
> >>
> >>called 5 times (one for 
> >>
> >>>each tag of interest: img, applet, input, body,
> >>
> >>table). Am I correct? 
> >>
> >>>The regexp-based implementation only scans
> through
> >>
> >>the HTML once. This 
> >>
> >>>could well explain most of the performance
> >>
> >>difference. Is there any way 
> >>
> >>>to recode the HTMLParser-based implementation to
> >>
> >>do the job in a single 
> >>
> >>>scan?
> >>>
> >>>How to reproduce the test:
> >>>- Get Apache and JMeter running (I'm running both
> >>
> >>on the same box, which 
> >>
> >>>is probably a bad idea).
> >>>- Uncompress the attached
> test-httpsamplerfull.tgz
> >>
> >>in the Apache 
> >>
> >>>docroot. It contains a Yahoo home page saved
> using
> >>
> >>Mozilla 1.5. (A 
> >>
> >>>proper test would use several other samples).
> >>>- Run the attached script and look at the Rate in
> >>
> >>the Aggregate Report.
> >>
> >>>On my IBM T30 with Pentium 4 M @ 2.2 GHz, 1 GB
> >>
> >>RAM, with JDK 1.4.2_02, 
> >>
> >>>no fiddling with the java arguments (yes, that
> >>
> >>means I'm using -Xincgc, 
> >>
> >>>which is probably the worst possible choice) I'm
> >>
> >>getting around 1 
> >>
> >>>sample/second with the HTPMLParser-based sampler
> >>
> >>and around 7 
> >>
> >>>sample/second with the Regexp-based one.
> >>>
> >>>In addition, the HTMLParser-based implementation
> >>
> >>is failing to download 
> >>
> >>>two images: powrdbyhp_blu_84x28_yahoo.gif (it is
> >>
> >>downloading the HTML 
> >>
> >>>page again instead) and 031121_l300.gif (it
> >>
> >>downloads nothing). I've 
> >>
> >>>used Mozilla's "Live HTTP Headers" to see what
> >>
> >>Mozilla does and it 
> >>
> >>>matches what the Regexp-based implementation is
> >>
> >>doing. I'd say there's a 
> >>
> 
=== message truncated ===


__________________________________
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HTTPSamplerFull performance: HTMLParser vs. Regexp-based

Reply via email to