En/na BAZLEY, Sebastian ha escrit:
As you are no doubt aware, I recently refactored the HTML Parsing code. - JTidy and HTMLParser now have their own separate class files.
There's a third one since yesterday evening: ParseRegexp -- a regexp-based parser, which performs much better memory-wise than the HTMLParser-based one. CPU-wise there doesn't seem to be much of a different, though.
- The parsing method is selected at run-time by HTTPSamplerFull.
During the refactoring process, I noticed that JTidy was not picking up some images, for example background table images, so I added code to catch some more images. This is likely to make the performance of JTidy worse, as the current design makes a separate pass through the DOM for each tag type - not very efficient.
I'm sure we could improve the JTidy performance by using a single pass through the DOM, picking up all the required nodes en route. There is an example of this (print nodes) on SourceForge. Whether this is worth it is another matter...
I believe we should put some work into testing and finally deciding for one single implementation. I once thought it may be worth keeping two implementations around, since I believer one would be more accurate and the other more performant... but the accurate one has proven to be that accurate and the performant one is not that performant :-) Keeping two around without the advantages/disadvantages of each of them being clear is confusing to users for no reason.
** the parser routines not only parse, they also retrieve the images/applets and create the sample results. I did not (yet) refactor that part of the code back into a common module, but I think it would be useful to do so.
+1 -- even if we finally keep one single implementation, this would make for cleaner code.
But I wonder whether it would not be better for the parser modules to just return a list of URLs, and leave it up to the caller to fetch them after doing the parse? That would certainly make it easier to write JUnit tests for the parsers; it ought to make the parser interface more generally useful. And it would help if/when we use a different HTTP protocol stakc, such as httpunit.
+1
** Only images (and applets) are parsed/fetched currently. If the purpose is to emulate a browser more closely, then it seems to me that we should consider fetching other files such as CSS and Javascript. To do this fully would be hard work, but it would be easy enough to fetch at least some such files. What do others think?
+0; +1 for the Regexp-based implementation.
I believe the parsing functionality only makes sense if it approaches browser behaviour as much as possible.
-- Salut,
Jordi.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
