good discussion. I get the feeling we need to unify the parse code. I plan on 
debugging the bug jordi found. unfortunately I won't have time until after the 
holidays.
 
If I can fix the bug with HTMLParser, do others feel it would be a strong candidate?
 
I have mixed feelings about having just one. For the casual user, one parser makes a 
lot of sense. For the hardcore developer, having multiple support is nice. I tend to 
use JMeter for heavy stress testing for sites that get million+ pageviews a day. These 
types of systems typically have a dedicated image server, so I normally test without 
getting the images. For me atleast, the accuracy is not as important as simulating 
huge loads.
 
For functional testing, performance isn't an issue and simulating browser behavior 
exactly would be great. I haven't used JMeter in a functional test mode, so I'm not a 
good judge on those requirements. The primary challenge that I can see is sites that 
have lots of banners. The only reliable way would be to search for all instances of 
".gif, .jpg" and so on. I believe both JTidy and HTMLParser would be considerably 
slower than the regexp approach. that's just a guess.
 
HTMLParser uses Visitor pattern and parses the content once. As long as you register a 
listener for the tag, it will parse it. HTMLParser like JTidy maintains the heirarchy 
information. This is necessary if you want to manipulate/parser multiple forms on a 
given page. Or simpler, if you want to parse multiple tables and keep the content 
within each separate.
 
if we go with the regexp approach, can it be made such that it maintains heirarchy 
information? I think of one example.  Say I want to use JMeter for functional testing 
on Superpages.com. Superpages.com populates the simple search with the previous 
search. lets say I want to verify the search parameters match the first 10 results 
using keyword comparison. if the search fails for the keyword, the advanced search 
form is displayed. If I want to compare the two search forms, I would need to traverse 
each form and compare the input nodes.
 
peter
 
 
 


Jordi Salvat i Alabart <[EMAIL PROTECTED]> wrote:


En/na BAZLEY, Sebastian ha escrit:
> As you are no doubt aware, I recently refactored the HTML Parsing code.
> - JTidy and HTMLParser now have their own separate class files.

There's a third one since yesterday evening: ParseRegexp -- a 
regexp-based parser, which performs much better memory-wise than the 
HTMLParser-based one. CPU-wise there doesn't seem to be much of a 
different, though.

> - The parsing method is selected at run-time by HTTPSamplerFull.
> 
> During the refactoring process, I noticed that JTidy was not picking up some
> images, for example background table images, so I added code to catch some
> more images. This is likely to make the performance of JTidy worse, as the
> current design makes a separate pass through the DOM for each tag type - not
> very efficient.
> 
> I'm sure we could improve the JTidy performance by using a single pass
> through the DOM, picking up all the required nodes en route. There is an
> example of this (print nodes) on SourceForge. Whether this is worth it is
> another matter...

I believe we should put some work into testing and finally deciding for 
one single implementation. I once thought it may be worth keeping two 
implementations around, since I believer one would be more accurate and 
the other more performant... but the accurate one has proven to be that 
accurate and the performant one is not that performant :-) Keeping two 
around without the advantages/disadvantages of each of them being clear 
is confusing to users for no reason.

> ** the parser routines not only parse, they also retrieve the images/applets
> and create the sample results. I did not (yet) refactor that part of the
> code back into a common module, but I think it would be useful to do so.

+1 -- even if we finally keep one single implementation, this would make 
for cleaner code.

> But I wonder whether it would not be better for the parser modules to just
> return a list of URLs, and leave it up to the caller to fetch them after
> doing the parse? That would certainly make it easier to write JUnit tests
> for the parsers; it ought to make the parser interface more generally
> useful. And it would help if/when we use a different HTTP protocol stakc,
> such as httpunit.

+1

> ** Only images (and applets) are parsed/fetched currently. If the purpose is
> to emulate a browser more closely, then it seems to me that we should
> consider fetching other files such as CSS and Javascript. To do this fully
> would be hard work, but it would be easy enough to fetch at least some such
> files. What do others think?

+0; +1 for the Regexp-based implementation.

I believe the parsing functionality only makes sense if it approaches 
browser behaviour as much as possible.

-- 
Salut,

Jordi.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------
Do you Yahoo!?
Free Pop-Up Blocker - Get it now

Reply via email to