As you are no doubt aware, I recently refactored the HTML Parsing code.
- JTidy and HTMLParser now have their own separate class files.
- The parsing method is selected at run-time by HTTPSamplerFull.

During the refactoring process, I noticed that JTidy was not picking up some
images, for example background table images, so I added code to catch some
more images. This is likely to make the performance of JTidy worse, as the
current design makes a separate pass through the DOM for each tag type - not
very efficient.

I'm sure we could improve the JTidy performance by using a single pass
through the DOM, picking up all the required nodes en route. There is an
example of this (print nodes) on SourceForge. Whether this is worth it is
another matter...

Peter (Lin) and I exchanged some e-mails about this, and concluded that
JTidy was still likely to need more memory than HTMLParser, as the entire
DOM needs to be constructed and saved before use, whereas HTMLParser does
not _have_ to do this (what it does, I don't know). Also, it does not seem
to be possible to switch off the tidy part of JTidy, so it is again likely
to be less efficient. I think the stand-alone image tests that Peter did
showed this.

A couple more items came up in the refactoring:

** the parser routines not only parse, they also retrieve the images/applets
and create the sample results. I did not (yet) refactor that part of the
code back into a common module, but I think it would be useful to do so.

But I wonder whether it would not be better for the parser modules to just
return a list of URLs, and leave it up to the caller to fetch them after
doing the parse? That would certainly make it easier to write JUnit tests
for the parsers; it ought to make the parser interface more generally
useful. And it would help if/when we use a different HTTP protocol stakc,
such as httpunit.

** Only images (and applets) are parsed/fetched currently. If the purpose is
to emulate a browser more closely, then it seems to me that we should
consider fetching other files such as CSS and Javascript. To do this fully
would be hard work, but it would be easy enough to fetch at least some such
files. What do others think?

S.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to