Those are interesting results.  My main reason for not
using regex, is I am not an expert at regexp.  My
guess is regexp will be faster than either JTidy or
HTMLParser, but the hardpart is extensibility.
Extending HTMLParser is considerably easier than
master regexp for me.


--- Jordi Salvat i Alabart <[EMAIL PROTECTED]> wrote:
> Hi.
> 
> I've finally found some time to test the performance
> of the 
> HTTPSamplerFull implementation currently in CVS
> (developped by Peter Lin 
> using HTMLParser) against the implementation I sent
> a while ago to the 
> list (developped by me using Regexps). [Remember:
> the objective is not 
> to decide which is best, but whether it's worth
> having both available to 
> script developers].
> 
> The results are not conclusive, but they prove that
> the issue deserves 
> further analysis:
> 
> 1/ On the example I've been using, the Regexp-based
> implementation was 
> more accurate than the HTMLParser-based one. This is
> very surprising to 
> me, since I expected the Regexp-based implementation
> to be generally 
> less accurate. I'll need some help on this one. More
> details later.

I can report this as a bug to the HTMLParser
developers and file a bug report. It worked for the
tests I ran, which was basically the benchmark classes
in the test directory. It could be a bug in how I
implemented it originally in NewHTTPSampler, which
sebastian has refactored last week.  The way
htmlparser work is registering listeners, so it does
it in one pass. I believe the cost is in building a
structured object.

protected static void addTagListeners(Parser parser) 
{
log.debug("Start : addTagListeners");
// add body tag scanner
parser.addScanner(new BodyScanner());
// add ImageTag scanner
LinkScanner linkScanner = new
LinkScanner(LinkTag.LINK_TAG_FILTER);
// parser.addScanner(linkScanner);
parser.addScanner(linkScanner.createImageScanner(ImageTag.IMAGE_TAG_FILTER));
// add input tag scanner
parser.addScanner(new InputTagScanner());
// add applet tag scanner
parser.addScanner(new AppletScanner()); 
}

You'll see that parse is only called once.

try {
// we start to iterate through the elements
for(NodeIterator e = parser.elements();
e.hasMoreNodes();)

> 
> 2/ On the example I've been using, the Regexp-based
> implementation was 
> at least 7 times faster than the HTTPParser-based
> one. A quick look at 
> the code suggests that the HTML Parser is being
> called 5 times (one for 
> each tag of interest: img, applet, input, body,
> table). Am I correct? 
> The regexp-based implementation only scans through
> the HTML once. This 
> could well explain most of the performance
> difference. Is there any way 
> to recode the HTMLParser-based implementation to do
> the job in a single 
> scan?

by design, regexp is a type 1 finite state machine and
therefore should be faster. But the challenge here is
this. If you need to parse an element that has
subnodes, doing it in regexp is harder. For example,
say users want the ability to parse a specific table.
I believe doing it in regexp would require multiple
steps. It may still be faster, but it would probably
take much more work to do the same task. I could be
wrong.

> In addition, the HTMLParser-based implementation is
> failing to download 
> two images: powrdbyhp_blu_84x28_yahoo.gif (it is
> downloading the HTML 
> page again instead) and 031121_l300.gif (it
> downloads nothing). I've 
> used Mozilla's "Live HTTP Headers" to see what
> Mozilla does and it 
> matches what the Regexp-based implementation is
> doing. I'd say there's a 
> bug in the HTMLParser. Can someone familiar with it
> have a look? (Hi 
> Peter!).

you can look at the benchmark classes I wrote to test
the performance against JTidy before I implemented the
sampler.

http://cvs.apache.org/viewcvs/jakarta-jmeter/src/htmlparser/org/htmlparser/tests/BenchmarkP.java
http://cvs.apache.org/viewcvs/jakarta-jmeter/src/htmlparser/org/htmlparser/tests/BenchmarkTidy.java

when I did a test using CNet and Yahoo homepage, it
did correctly get all the image tags. Is the image a
banner? I did notice banners weren't loaded in my
test, but it was because the link pointed to another
server. I believe this may be the result of how I
implemented support. The current implementation gets
the image, and input tags. That was how JTidy was
implemented, so I unknowingly ported the bad
implementation. Or am I missing something?

I'm pretty busy these days, so I may not have time to
fix it for a week or two. I think doing what works or
what users expect is the right decision. If htmlparser
doesn't meet the performance requirements, then I see
no reason lock JMeter to Htmlparser. In defense of
HTMLParser, it is a solid library and does improve the
throughput of JMeter. The extensibility of the design
to me is sound and very extensible. Everyone has
different preferences, so maybe support both ways of
parsing HTML? For myself, I don't really have time to
become a regexp guru and one of the approaches i
considered was to write a HTML parser using a
stack-based parser from scratch. Ultimately I decided
HTMLParser provided the features and extensibility.
Plus i wasn't confident for complex use cases, a
stack-based parser would be easy for others to extend
and use.

on a benchmark note, I ran stand alone benchmarks with
just JTidy and HTMLparser and with NewHTTPSamplerFull.
I also profiled the sampler using OptimizeIt. All of
the data I generated showed HTMLParser provide real
benefit.

peter




__________________________________
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to